• Research article
  • Open access
  • Published: 07 June 2021

Predicting the incidence of COVID-19 using data mining

  • Fatemeh Ahouz 1 &
  • Amin Golabpour   ORCID: orcid.org/0000-0001-7649-4033 2  

BMC Public Health volume  21 , Article number:  1087 ( 2021 ) Cite this article

13k Accesses

12 Citations

Metrics details

The high prevalence of COVID-19 has made it a new pandemic. Predicting both its prevalence and incidence throughout the world is crucial to help health professionals make key decisions. In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease.

The COVID-19 datasets provided by Johns Hopkins University, contain information on COVID-19 cases in different geographic regions since January 22, 2020 and are updated daily. Data from 252 such regions were analyzed as of March 29, 2020, with 17,136 records and 4 variables, namely latitude, longitude, date, and records. In order to design the incidence pattern for each geographic region, the information was utilized on the region and its neighboring areas gathered 2 weeks prior to the designing. Then, a model was developed to predict the incidence rate for the coming 2 weeks via a Least-Square Boosting Classification algorithm.

The model was presented for three groups based on the incidence rate: less than 200, between 200 and 1000, and above 1000. The mean absolute error of model evaluation were 4.71, 8.54, and 6.13%, respectively. Also, comparing the forecast results with the actual values in the period in question showed that the proposed model predicted the number of globally confirmed cases of COVID-19 with a very high accuracy of 98.45%.

Using data from different geographical regions within a country and discovering the pattern of prevalence in a region and its neighboring areas, our boosting-based model was able to accurately predict the incidence of COVID-19 within a two-week period.

Peer Review reports

On December 8, 2019 the Chinese government reported the death of one patient and hospitalization of 41 others with unknown etiology in Wuhan [ 1 ]. This cluster initiated the novel coronavirus (COVID-19) epidemic respiratory disease. While the early cases were linked to the wet market, human-to-human transmission had led to widespread outbreak of the virus nationwide [ 2 ]. On January 30, 2020 the World Health Organization (WHO) declared COVID-19 as a public health emergency with international concern (PHEIC) [ 3 ].

On the basis of the global spread and severity of the disease, on March 11, 2020 the Director-General of WHO officially declared the COVID-19 outbreak a pandemic [ 4 ]. The pandemic as such, entered a new stage with rapid spread in countries outside China [ 5 ]. According to the 56th WHO situation report [ 6 ], as of March 16, 2020 the number of COVID-19 confirmed cases outside China exceeded those inside. Consequently, after March 17, 2020 WHO began to report the number of confirmed and dead cases on each continent as opposed to merely providing patient statistics in and out of China.

According to the 70th WHO situation report [ 7 ], by March 30, 2020 the number of people infected with COVID-19 worldwide were 693,282. 392,815 (about 57%) of whom were in Europe, 142,081 (about 20%) in the Americas, 103,775 (about 15%) in Western Pacific, 46,329 (about 7%) in Eastern Mediterranean, 4084 (about 0.5%) in South-East Asia, and 3486 (about 0.5%) in Africa. Of that total, 33,106 died worldwide, 23,962 of whom (around 72% of all death) were in Europe, 3649 (around 11%) in Western Pacific, and 5488 (around 17%) were in other regions collectively.

Due to the growing prevalence of COVID-19 across the world, several works have examined different aspects of the disease. They involve identifying the source of the virus as well as analyzing its gene sequences [ 8 , 9 ], patient information [ 10 ], early cases in the countries infected [ 11 , 12 , 13 ], methods of virus detection [ 14 , 15 ], the epidemiological outbreak [ 16 , 17 ], and predicting COVID-19 cases [ 2 , 17 , 18 , 19 , 20 ].

In [ 18 ], using heuristic method and WHO situation reports, an exponential curve was proposed to predict the number of cases in the next 2 weeks by March 30, 2020. The model was then tested for the 58th situation report. The authors reported 1.29% error. Afterwards, on the assumption that the current trend could continue for the next 17 days, they predicted that by March 30, 1 million cases outside China would be reported in the 70/71th WHO situation report. Given that the number of confirmed cases outside China was 693,176 on March 30 [ 21 ], their forecast error was 44.26%.

In [ 17 ], the CoronaTracker team proposed a Susceptible-Exposed-Infectious-Recovered (SEIR) model based on the queried data in their website, and made the 240-day prediction of COVID-19 cases in and out of China, started on 20 January 2020. They predicted that the outbreak would reach its peak on May 23, 2020 and the maximum number of infected individuals would amount to 425.066 million globally. In addition, the authors stated that this number would start to drop around early July 2020 and reach below 10,000 on 14 Sep 2020. Given the information available now, these predictions were far from what really happened around the world.

Elsewhere [ 19 ], the authors examined some available models to predict 5 and 10-day ahead of cumulative cases in Guangdong and Zhejiang by February 23, 2020. They used generalized logistic growth, the Richards growth, and a sub-epidemic wave model, which were utilized to forecast some previous infectious outbreaks.

Although some works have proposed methods for predicting COVID-19 cases, to our knowledge at the time of writing this paper, none have been comprehensive and have not predicted the new cases in each geographical region along with each continent. In this study, using the COVID-19 Cases dataset provided by Johns Hopkins University [ 22 ], we aim to predict COVID-19 infected people in each geographical regions included in the dataset as well as each continent in the coming 2-week period. Predicting the situation in the current pandemic is very crucial to containment of the threat because it helps make timely medical measures e.g. equipping medical facilities, managing resource allocation, sending more personnel to high-risk areas, deciding whether to close borders or resume traffic, and suspending or resuming community services.

COVID-19 epidemiological data have been compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) [ 22 ]. The data have been provided in three separate datasets for confirmed, recovered, and death cases since January 22, 2020 and are updated daily. In each of these datasets, there is a record (row) for every geographic region. The variables in each dataset are province/state, country/region, latitude, longitude, and the incremental dates since January 22. For each region, the value for any date indicates the cumulative number of confirmed/recovered/death cases from January 22, 2020.

In this study, according to the input requirements of the proposed model, we changed the data representation so that instead of three separate datasets for three groups of confirmed, recovered, and death cases, only one dataset containing the information of all three groups was arranged. In this new dataset, each record (or row) of the dataset contains information about the number of confirmed, recovered, or deaths per day for each geographic region. As a result, the variables in this new dataset are: Province / State, Country / Region, Latitude (Lat), Longitude (Long), Date (specifying a certain date), Cases (indicating the number of confirmed, recovered, or death cases on the certain date), and Type (specifying the type of cases, i.e. confirmed, recovered, or death) as suggested by Rami Krispin [ 23 ].

In this study, the data were applied into the analysis by March 29, 2020, with 50,660 records and 7 variables. This period includes information about parts of winter and spring in the northern hemisphere and summer and autumn in the southern hemisphere. By March 29, the dataset consisted of cases from 177 countries and 252 different regions around the world. There were 720,139 confirmed, 33,925 death, and 149,082 recovered cases in the dataset.

Preprocessing step

Pre-processing was carried out on the dataset before training the proposed model. Figure  1 shows the preprocessing steps. The dataset was first examined for noise, since the noise data were considered as having negative values in Cases variable. The dataset contained 42 negative values in this variable. After deleting these values, the number of records were reduced to 50,618.

figure 1

Preprocessing steps on COVID-19 dataset

Subsequently, the Date variable was written in numerical format and renamed into “Day” variable. To that effect, January 22, 2020 marked the beginning of the outbreak and the next days were calculated in terms of distance from the origin. As a result, January 22 and March 29 were considered as Day 1 and Day 68, respectively.

Since each region is uniquely identified by its latitude and longitude, the data for Province/State and Country/Region were excluded from the dataset. Moreover, as the study aimed at predicting the incidence in any geographical region, we considered only those records providing information on the confirmed cases (17,179 records), but not on the dead or the recovered. So, after preserving the records with “Confirmed” value in the Type variable, it was deleted from the dataset. In this study, the “Cases” is considered as the dependent variable.

Constructing the prediction model

An ensemble method of regression learners was utilized to predict the incidence of COVID-19 in different regions. The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models called weak learners [ 24 ]. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. One of the most commonly used loss functions is least-squares (LS) error [ 25 ].

In this study, the model employed a set of individual Least-squares boosting (LSBoost) learners trying to minimize the mean squared error (MSE). The output of the model in step m, F m (x), was calculated using Eq. 1 :

where x is input variable and h(x;a) is the parameterized function of x, characterized by parameters a [ 25 ]. The values of ρ and a were obtained from Eq. 2 :

Where N is the number of training data and \( \tilde{y}_{i} \) is the difference between the observed response and the aggregated prediction up to the previous step.

Due to the recent major changes in the incidence of COVID-19 worldwide over the past 2 weeks, we aimed to predict the number of new cases as an indicator of prevalence over the next 2 weeks. The structure of the proposed method is shown in Fig.  2 .

figure 2

The structure of the Proposed model

Since the incubation period of COVID-19 can be 14 days [ 26 ], we assumed that we needed at least 14 days prior information to predict the incidence of Covid-19 in 1 day. Therefore, the proposed model examined all possible intervals between the first and the last 14 days to find the optimal time period to use its information to predict the number of cases in the coming days.

We hypothesized that the incidence in any region might follow the pattern of recent days in the same region and nearby. Therefore, after determining the optimal time period, the model added the information on confirmed cases in each region and nearby in the specified period to the same region’s record in the dataset.

After setting the time interval, [A, B], and the number of neighbors, the dataset was rearranged. In this case, the number of records was reduced from N to M, where M is calculated from Eq. 3 :

Where R is the number of different regions in the dataset and B is the last day of the time period. Similarly, the number of variables stored for each record increased from the first 4 variables (latitude, longitude, Day and Cases) to F, which is calculated from Eq. 4 :

Where NN is the number of neighbors and 4 is the number of variables in the original data set because for each geographical region, Lat, Long, Day and Cases are stored. |B-A + 1| is the number of days within the period that participate in the forecast of the next 14 days. The value of NN is multiplied by 2 because for each neighbor, latitude and longitude are added to the record information. Furthermore, for each day within the period of forecast, the Cases were added to the record information, so NN was multiplied by|B-A + 1|. For each region, the Day and Cases data during the period were added as well. Thus, |B-A + 1| was multiplied by 2. It should be noted, however, that the dependent variable remained the Cases of current day.

Since the number of both the nearby regions and the previous days effective in forecasting were unknown, we assumed these values to be unknown variables and obtained the most accurate model by examining all possible combinations of such variables in an iterative process.

The accuracy of the model was evaluated in terms of Mean Squared Error (MSE) and Mean Absolute Error (MAE); Due to the normalization of MAE between [0, 1], the evaluation error is equal to 2 times MAE. To do so, the information of the last 2 weeks on all regions was considered as a validation set, and the model was trained using other information in the dataset.

Forecast incidence in the next 2 weeks

A new test set was created to predict incidence in the next 2 weeks (by April 12, 2020). The number of records in this dataset was equal to that of unique geographical regions in the COVID-19 dataset. Then, according to the best neighborhood and optimal time interval specified in the previous step, the necessary features were provided for each record. After that, the best model was created in the previous step was retrained on the entire dataset as a training set. Later on, this model was examined on the new test set to predict the incidence rate.

Evaluation the actual performance of the proposed model

Given that the actual number of confirmed cases within March 30–April 12, 2020 period was available at the time of review, the performance of the proposed model was measured based on percent error between the predicted and the actual values. The percent error was calculated from Eq. 5 :

Where δ is percent error, v A is the actual observed value and v E is the expected (predicted) value. Furthermore, according to the predicted and actual confirmed cases in 252 geographical regions in the dataset, the continental incidence rate was calculated using Eq. 6 :

where I C is the incidence in each continent and I W is the global incidence of COVID-19 from March 30 to April 12, 2020.

The experimentation platform is Intel® Core™ i7-8550U CPU @ 1.80GHz 1.99 GHz CPU and 12.0 GB of RAM running 64-bits OS of MS Windows. The pre-processing and model construction has been implemented in MATLAB.

Model construction

The number of neighbors ranged from zero to 10. The value of 10 was obtained by trial and error. Euclidean distance based on latitude and longitude was used to calculate nearest neighbors. Given that the dataset contains data from January 22, 2020 to March 29, 2020 for the day we want to predict the incidence, the nearest and farthest days were selected as 14 and 54, respectively. Because the number of confirmed cases varies greatly from region to region, the proposed algorithm was implemented for 3 different groups of regions: for regions with less than 200 confirmed cases per day (16,825 records), those with 200 to 1000 cases per day (220 records), and those with over 1000 cases per day (152 records).

Table  1 shows the results of the best proposed model with regard to the different composition of the neighborhood and the days before. In order to predict the incidence of COVID-19 in regions with more than 1000 confirmed cases per day, the proposed model demonstrated the best performance with MAE of 6.13%, considering the information of the last 14 to 17 days of the region and its two neighboring areas. In the dataset, the number of cases records in these regions varied from 1019 to 19,821.

For regions with 200 to 1000 cases per day, the proposed model performed best with respect to the 9 nearest neighboring areas and with data from the last 14 to 20 days, with MAE of 8.54% on the validation set. For regions with fewer than 200 cases per day, on the other hand, the proposed model performs best with MAE of 4.71%, taking into account the region data for the last 14 to 34 days.

Prediction of incidence by April 12, 2020

Figure  3 shows the prevalence of the COVID-19 from the first week to the tenth week in different regions, based on the information provided by the COVID-19 epidemiological dataset [ 22 ]. In this Figure, the diameter of the circles is proportional to the prevalence in those regions and the center of each circle matches the geographical coordinates of the region.

figure 3

Visualize the outbreak over the days (created by ourselves, gimp software, open source)

Table  2 shows the results of the forecast as to the number of new cases per day on different continents. According to the location of the continents in the northern and southern hemispheres, the period in question contains winter and early spring information in the continents of North America, Europe and almost entire parts of Asia. It includes summer and parts of autumn in Australian and approximately whole South America. Given that Africa lies in all four hemispheres, the data recorded for this continent in this period in the data set includes all seasons.

By April 12, 1,134,018 new cases worldwide were expected to be on record. Of these, Europe with 687,665 (60.64%), North America with 272,957 (24.07%) and Asia with 107,000 (9.44%) new cases were the most prevalent, whereas Australia with 14,526 (1.28%), Africa with 19,131 (1.69%) and South America with 32.739 (2.89%) new cases were the least incidence. Africa, Europe and South America had the highest rates of COVID-19 incidence, with 283, 221.23, and 178.87%, respectively. Asia was the only continent that had slowed its growth with an incidence rate of − 34.

Figure  4 shows the prediction of incidence rates in different regions. Accordingly, the prevalence would decrease over the next 2 weeks in the Middle East, yet it would increase in North America and Europe. Outbreak forecasts for 244 geographic regions are provided in Additional file  1 : Appendix 1.

figure 4

Prediction of the incidence in week 10 and 11 (created by ourselves, gimp software, open source)

Comparison of predicted and actual cases from March 30 to April 12, 2020

Table  3 shows the total number of daily cases in the 252 regions surveyed between March 30 and April 12, 2020. As shown, the daily percent error is below 20%. The best accuracy of the proposed model in predicting the incidence of COVID-19 was obtained on April 10 with 99.6%, and the worst on April 11 with 81.3%. Data analysis of the two-week continental incidence rates are also shown in Fig.  5 . The best predicted continental incidence rates were found in South America and Asia with 18.15 and 21.04% percent error, respectively. The worst cases, still, were observed in Africa and Australian with more than 80% percent errors.

figure 5

Comparison of predicted and actual continental incidence rates between March 30 and April 12, 2020

Data mining is capable of presenting a predictive model and extracting new knowledge from retrospective data. The way data is processed, as well as the variables selected, had a significant impact on knowledge discovery. There are various data mining techniques used to predict an outbreak. As an actual global health concern, COVID-19 had already developed into one of the world’s major emergencies. The present study proposed to investigate its outbreak worldwide during a two-week period via a predictive model based on retrospective data. It was concluded that such a model could be presented with acceptable error rates.

The study made use of a coronavirus dataset to design an incidence of COVID-19 prediction model. According to the incidence rate per day, the model was trained based on three groups of below 200, 200–1000 and above 1000 cases. One-way ANOVA results showed that there was a statistically significant difference between the prevalence rates in the three groups ( p -value < 0.001). For each group, the prediction model was implemented and the incidence was predicted for the next 2 weeks. The proposed model achieved about 10% error (90% accuracy) for the group of less than 200 cases, 18% error (82% accuracy) for the group of 200–1000 cases, and 13% error (87% accuracy) for that exceeding 1000 cases.

In this study, as the incidence of COVID-19 was evaluated for 68 days worldwide, and a prediction model presented for the two-week period (i.e., March 30–April 12, 2020), more than 1000,000 people were expected to contract the disease within the next 2 weeks, which was statistically up 58% compared to 700,000 of the outbreak by March 29, 2020.

The study found that adjacent regions with a prevalence of less than 1000 had similar incidence, so the incidence of each of these regions could be determined from information on neighboring areas. The use of neighborhood information enables the model to indirectly consider the effective policies of other regions in predicting the incidence of COVID-19 in each region.

Given that the proposed model was trained using only 68-day data (which was the most up-to-date information at the time of writing), the accuracy of predicting the incidence above 81% was deemed acceptable for such an unknown disease. Further, according to the results shown in Table 3 , the model prediction error for a total of 12 days for 252 regions was less than 2%. Therefore, if the data of each country were stored more precisely using more geographical regions, it was promising that we could create an accurate model for predicting the incidence of covid-19 over a two-week period in each country. While many unknowns would be expected of a new pandemic, having this information can guide planning and resource allocation for prevention, treatment, and palliative care.

Although time series usually need to be long enough (normally a few years) to adequately account for seasonality, based on the results of the model implementations, we believe that this model, even with that short a time series, is able to manage seasonality and can predict the number of cases with acceptable accuracy (see Additional file 1 : Appendices 2 and 3 for the results of all analyses). However, it is suggested that future research specifically address the effect of seasonal changes on the prevalence of this disease.

One of the limitations of the study was that the dataset did not provide sufficient information from all continents. Given that the disease did not occur simultaneously on all continents, and the continental prevalence was in most cases after the 40th day of the first case in China, 68 days of data did not seem sufficient to predict the prevalence of such an unknown disease.

In Africa, the first case was reported in more than 80% of the 45 geographical regions since the 50th day. The number of confirmed cases since then was 4682, which was 97.83% of the total 4783 confirmed cases in Africa. In Australian, the first case was reported in more than 45% of the 11 geographical regions from the 40th day onwards. Also, out of a total of 4504 cases on the continent, 4478 cases (99.4%) were confirmed then.

In Europe, the first case was reported in 60 of the 69 geographic regions in the dataset from the 40th day onwards. Out of a total of 385,735 cases, information on 384,268 cases (i.e. 99.62%) has also been entered since that day. Similarly, South America confirmed its first case after the 40th day in 16 out of 17 regions. It is noteworthy that out of a total of 11,642 cases, 11,542 (14.99%) were confirmed from day 50 onwards.

In contrast, 88% of the North American regions had their first cases confirmed since day 50. In addition, of the 46 confirmed cases by March 29, 2020 on the continent, 38 were reported since day 50 (82.61%) And 41 were confirmed from day 40 onwards (89.13%).

Due to insufficient information on some continents as a result of their prevalence later than the declared beginning of the outbreak, the effect of measures such as increasing the number of tests taken per day as well as quarantine restrictions in some continents such as Europe, begin in place from March 30 to April 12, were not reflected in the dataset.

Nevertheless, the inaccurate prediction of the number of cases in Africa could be attributed, in turn, to the insufficient information about the continent in the dataset. In 80% of the African regions, the first confirmed case was recorded 50 days into the outbreak. Out of a total of 4786 cases there, up until the 68th day, 4682 cases (more than 97%) were reported since day 50.

In addition, due to the fact that latitude and longitude are two important indicators in the data set, the non-uniformity of recording these information for different geographical regions is another limitation of the work; for some areas, the information is about one state of a country and for some areas it is for the whole country. For example, in the dataset for USA, all cases are provided in terms of only one latitude and longitude, but for Netherlands, the data of COVID-19 cases are provided for four different latitude and longitude pairs.

Another limitation of this study was the use of data from all countries coping with in COVID-19 with their own protocols for testing and identifying patients. However, in general, this is the only global dataset for COVID-19 that has been used in other studies [ 16 , 17 ]. Besides, early information on each country was taken into account in the proposed model to predict the incidence in that country to reduce the mentioned limitation.

It is worth noting that the model rests on both the info provided by the dataset and the current measures taken in dealing with the disease. Hence, if government’s’ policies to tackle the disease change, so will the accuracy of the information.


Since epidemiological models such as SIR failed to accurately predict COVID-19 cases, as stated in [ 17 , 27 , 28 ], the current study relied on data from January 22 to March 29 provided by Johns Hopkins University and proposed a more complex model based on machine learning methods. The mean absolute error of the proposed model was 6.13% in predicting the incidence of COVID-19 in the two-week period of March 16–29 for regions with more than 1000 cases per day. The MAE was 8.45 and 4.71% for regions with a daily incidence rate between 200 and 1000 cases and less than 200 cases, respectively. An accuracy of more than 82% on the evaluation set confirms our perception that the pattern of incidence of a region is influenced by the pattern of disease in recent days in the same region and neighboring areas.

Last but not least, despite numerous limitations of the dataset, lack of knowledge about such an unknown disease and changes in disease control policies in different countries during the period under scrutiny, the proposed model proved effective in predicting the global incidence of COVID-19 in the two-week period of March 30 and April 12 with 98.45% accuracy. In addition, the accuracy of the proposed model in predicting daily cases in a worst-case scenario was 81.31%.

This model is written in general and can be run for different intervals (see Additional file 1 : Appendix 4). It is suggested that the model be implemented for future data as well.

Availability of data and materials

The dataset analyzed during the current study is public and it is available in the [ https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases ] and in [ https://codeload.github.com/RamiKrispin/coronavirus-csv/zip/master ].


World Health Organization

Public Health Emergency with International Concern


Johns Hopkins University Center for Systems Science and Engineering

Least-squares boosting

Mean Squared Error

Mean Absolute Error

Nkengasong J. Author Correction: China’s response to a novel coronavirus stands in stark contrast to the 2002 SARS outbreak response. Nat Med. 2020;26(3):441. https://doi.org/10.1038/s41591-020-0816-5 .

Roosa K, Lee Y, Luo R, Kirpich A, Rothenberg R, Hyman JM, et al. Real-time forecasts of the COVID-19 epidemic in China from February 5th to February 24th, 2020. Infect Dis Model. 2020;5:256–63. https://doi.org/10.1016/j.idm.2020.02.002 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Eurosurveillance Editorial T. Note from the editors: World Health Organization declares novel coronavirus (2019-nCoV) sixth public health emergency of international concern. Eurosurveillance. 2020;25(5):2–3.

Article   Google Scholar  

World Health Organization, WHO Director-General's opening remarks at the media briefing on COVID-19 - 11 March 2020. 2020. Available from: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19%2D%2D-11-march-2020 . Accessed 27 May 2021.

Bedford J, et al. COVID-19: towards controlling of a pandemic . 2020.

Google Scholar  

Who, World Health Organization, Coronavirus disease 2019 (COVID-19) situation report −60. 2020.

World Health Organization, Coronavirus disease 2019 (COVID-19) Situation Report −70. 2020 [updated 19March 2020. Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_4 . Accessed 27 May 2021.

Ji W, Wang W, Zhao X, Zai J, Li X. Cross-species transmission of the newly identified coronavirus 2019-nCoV. J Med Virol. 2020;92(4):433–40. https://doi.org/10.1002/jmv.25682 .

Paraskevis D, Kostaki EG, Magiorkinis G, Panayiotakopoulos G, Sourvinos G, Tsiodras S. Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infect Genet Evol. 2020;79:104212. https://doi.org/10.1016/j.meegid.2020.104212 .

Huang C, Wang Y, Li X. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China (vol 395, pg 497, 2020). Lancet. 2020;395(10223):496.

Kim JY, Choe PG, Oh Y, Oh KJ, Kim J, Park SJ, et al. The first case of 2019 novel coronavirus pneumonia imported into Korea from Wuhan, China: implication for infection prevention and control measures. J Korean Med Sci. 2020;35(5):e61.  https://doi.org/10.3346/jkms.2020.35.e61 .

Bernard Stoecklin S, Rolland P, Silue Y, Mailles A, Campese C, Simondon A, et al. First cases of coronavirus disease 2019 (COVID-19) in France: surveillance, investigations and control measures, January 2020. Euro Surveill. 2020;25(6):2000094. https://doi.org/10.2807/1560-7917.ES.2020.25.6.2000094 .

Giovanetti M, Benvenuto D, Angeletti S, Ciccozzi M. The first two cases of 2019-nCoV in Italy: Where they come from? J Med Virol. 92(5):518–21. https://doi.org/10.1002/jmv.25699 .

Corman VM, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020;25(3):23–30.

Zhang NR, et al. Recent advances in the detection of respiratory virus infection in humans. J Med Virol. 2020;92(4):408–17. https://doi.org/10.1002/jmv.25674 .

Dey SK, Rahman MM, Siddiqi UR, Howlader A. Analyzing the epidemiological outbreak of COVID-19: a visual exploratory data analysis approach. J Med Virol. 92(6):632–8. https://doi.org/10.1002/jmv.25743 .

Binti Hamzah FA, et al. CoronaTracker: world-wide COVID-19 outbreak data analysis and prediction . 2020.

Koczkodaj WW, Mansournia MA, Pedrycz W, Wolny-Dominiak A, Zabrodskii PF, Strzałka D, et al. 1,000,000 cases of COVID-19 outside of China: The date predicted by a simple heuristic. Glob Epidemiol. 2020;2:100023. https://doi.org/10.1016/j.gloepi.2020.100023 .

Roosa K, Lee Y, Luo R, Kirpich A, Rothenberg R, Hyman JM, et al. Short-term Forecasts of the COVID-19 Epidemic in Guangdong and Zhejiang, China: February 13–23, 2020. J Clin Med. 2020;9(2):596. https://doi.org/10.3390/jcm9020596 .

Nishiura H, Jung SM, Linton NM, Kinoshita R, Yang YC, Hayashi K, et al. The extent of transmission of novel coronavirus in Wuhan, China, 2020. J Clin Med. 2020;9(2):330. https://doi.org/10.3390/jcm9020330 .

Organization, W.H. Coronavirus disease 2019 (COVID-19) Situation Report −70. 2020. Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_4 .

(CCSE), J.H.U.C.f.S.S.a.E.J. Novel Coronavirus (COVID-19) Cases Data. 2020. Available from: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases .

Krispin R. Coronavirus. 2020. Available from: https://github.com/RamiKrispin/coronavirus .

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, second edition. Springer Series in Statistics. New York: Springer-Verlag; 2008.

Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2000;29:1189–232. https://doi.org/10.1214/aos/1013203451 .

Organization, w.H. Transmission of SARS-CoV-2: implications for infection prevention precautions. 2020. Available from: https://www.who.int/news-room/commentaries/detail/transmission-of-sars-cov-2-implications-for-infection-prevention-precautions#:~:text=The%20incubation%20period%20of%20COVID,to%20a%20confirmed%20case .

Postnikov EB. Estimation of COVID-19 dynamics “on a back-of-envelope”: Does the simplest SIR model provide quantitative parameters and predictions? Chaos, Solitons Fractals. 2020;135:109841. https://doi.org/10.1016/j.chaos.2020.109841 .

Cooper I, Mondal A, Antonopoulos CG. A SIR model assumption for the spread of COVID-19 in different communities. Chaos, Solitons Fractals. 2020;139:110057.

Download references


The authors appreciate Deputy of research and technology of Khatam Alanbia University of technology.

Not applicable.

Author information

Authors and affiliations.

Department of Computer Engineering, School of Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran

Fatemeh Ahouz

School of Medicine, Shahroud University of Medical Sciences, Shahroud, Iran

Amin Golabpour

You can also search for this author in PubMed   Google Scholar


‘FA’ and ‘AG’ equally contributed to the conception, design of the work, analysis and interpretation of data. In addition, they read and approved the final manuscript.

Corresponding author

Correspondence to Amin Golabpour .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: appendix 1..

Point-to-point forecast for all areas in the dataset. Appendix 2. Investigation the effect of seasonal changes on model performance. Appendix 3. The performance of the proposed method on randomly selected regions. Appendix 4. The results of the proposed method on the updated data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Ahouz, F., Golabpour, A. Predicting the incidence of COVID-19 using data mining. BMC Public Health 21 , 1087 (2021). https://doi.org/10.1186/s12889-021-11058-3

Download citation

Received : 03 April 2020

Accepted : 13 May 2021

Published : 07 June 2021

DOI : https://doi.org/10.1186/s12889-021-11058-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining

BMC Public Health

ISSN: 1471-2458

covid 19 data mining research paper

Insights from the COVID-19 Pandemic: A Survey of Data Mining and Beyond

  • Published: 22 June 2024

Cite this article

covid 19 data mining research paper

  • Imad Afyouni 1 ,
  • Ibrahim Hashim 1 ,
  • Zaher Aghbari 1 ,
  • Tarek Elsaka 2 ,
  • Mothanna Almahmoud 3 &
  • Laith Abualigah 4 , 5 , 6 , 7  

58 Accesses

Explore all metrics

The global health crisis of COVID-19 has ushered in an era of unprecedented data generation, encompassing the virus’s transmission patterns, societal consequences, and governmental responses. Data mining has emerged as a pivotal tool for extracting invaluable insights from this voluminous dataset, offering critical support for informed decision-making. While existing surveys primarily explore methodologies for detecting COVID-19 in medical imagery and official sources, this article comprehensively examines the pandemic through big data mining. We emphasize the significance of social network analysis, shedding light on the pandemic’s profound influence on community socio-economic behavior. Additionally, we illuminate advancements in diverse domains, encompassing behavioral impact analysis on social media, contact tracing implications, early disease screening through medical imaging, and insights derived from health-related time-series data analytics. Our study further organizes the literature by categorizing it based on data sources, dataset types, analytical approaches, techniques, and application scenarios. Finally, we delineate prevailing challenges and forthcoming research prospects, charting the course for future investigations.

Graphical abstract

covid 19 data mining research paper

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

covid 19 data mining research paper

Similar content being viewed by others

covid 19 data mining research paper

Big data analytics as a tool for fighting pandemics: a systematic review of literature

covid 19 data mining research paper

COVID-19 early-alert signals using human behavior alternative data

covid 19 data mining research paper

Social Media for Nowcasting Flu Activity: Spatio-Temporal Big Data Analysis

Data availibility statement.

Data is available from the authors upon reasonable request.


Abdalla, W., Renukappa, S., & Suresh, S. (2023). Managing covid-19-related knowledge: A smart cities perspective. Knowledge and Process Management, 30 (1), 87–109.

Article   Google Scholar  

Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M., & Shah, Z. (2020). Top concerns of tweeters during the covid-19 pandemic: Infoveillance study. Journal of Medical Internet Research, 22 (4), e19016.

Abdul-Mageed, M., & Diab, M. T. (2011). Subjectivity and sentiment annotation of modern standard arabic newswire. In: Proceedings of the 5th linguistic annotation workshop , pp. 110–118.

Abdul-Mageed, M., & Diab, M., (2014) SANA: A large scale multi-genre, multi-dialect lexicon for Arabic subjectivity and sentiment analysis. In: Proceedings of the ninth international conference on Language Resources and Evaluation (LREC’14) , European Language Resources Association (ELRA), Reykjavik, Iceland, pp. 1162–1169.

Abuhammad, S., Khabour, O. F., & Alzoubi, K. H. (2020). Covid-19 contact-tracing technology: Acceptability and ethical issues of use. Patient Preference and Adherence, 14 , 1639.

Adly, A. S., Adly, A. S., & Adly, M. S. (2020). Approaches based on artificial intelligence and the internet of intelligent things to prevent the spread of covid-19: Scoping review. Journal of Medical Internet Research, 22 (8), e19104.

Agarwal, A., Salehundam, P., Padhee, S., Romine, W. L., & Banerjee, T. (2020). Leveraging natural language processing to mine issues on twitter during the covid-19 pandemic. arXiv:2011.00377

Ahmed, N., Michelin, R. A., Xue, W., Ruj, S., Malaney, R., Kanhere, S. S., Seneviratne, A., Hu, W., Janicke, H., & Jha, S. K. (2020). A survey of covid-19 contact tracing apps. IEEE Access, 8 , 134577–134601.

Ajaz, F., Naseem, M., Sharma, S., Shabaz, M., & Dhiman, G. (2022). Covid-19: Challenges and its technological solutions using iot. Current Medical Imaging, 18 (2), 113–123.

Alamoodi, A., Zaidan, B., Zaidan, A., Albahri, O., Mohammed, K., Malik, R., Almahdi, E., Chyad, M., Tareq, Z., Albahri, A., et al. (2020). Sentiment analysis and its applications in fighting covid-19 and infectious diseases: A systematic review. Expert Systems with Applications , 114155.

Alanazi, E., Alashaikh, A., Alqurashi, S., & Alanazi, A. (2020). Identifying and ranking common covid-19 symptoms from tweets in Arabic: Content analysis. Journal of Medical Internet Research, 22 (11), e21329.

Alarabi, L., Basalamah, S., Hendawi, A., Abdalla, M. (2021). Traceall: A real-time processing for contact tracing using indoor trajectories. Information , 12 (5). https://doi.org/10.3390/info12050202 , https://www.mdpi.com/2078-2489/12/5/202

Alelyani, M., Alghamdi, A., Shubayr, N., Alashban, Y., Almater, H., Alamri, S., & Alghamdi, A. J. (2021). The impact of the covid-19 pandemic on medical imaging case volumes in aseer region: A retrospective study. Medicines, 8 (11), 70.

Alqurashi, S., Alhindi, A., & Alanazi, E. (2020). Large arabic twitter dataset on covid-19, arXiv:2004.04315

Alqurashi, S., Alhindi, A., & Alanazi, E. (2020). Large arabic twitter dataset on covid-19. arXiv:2004.04315

Al-Rawi, A., & Shukla, V. (2020). Bots as active news promoters: A digital analysis of covid-19 tweets. Information, 11 (10), 461.

Alsudias, L., & Rayson, P. (2020). Covid-19 and arabic twitter: How can arab world governments and public health organizations learn from social media?. In: Proceedings of the 1st workshop on NLP for COVID-19 at ACL 2020 .

Alsudias, L., & Rayson, P. (2020). COVID-19 and Arabic Twitter: How can Arab world governments and public health organizations learn from social media? In: Proceedings of the 1st workshop on NLP for COVID-19 at ACL 2020 , Association for Computational Linguistics, Online. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.16

Alzahrani, S. I., Aljamaan, I. A., & Al-Fakih, E. A. (2020) Forecasting the spread of them covid-19 pandemic in Saudi Arabia using arima prediction model under current public health interventions. Journal of Infection and Public Health , 13 (7) 914–919.

Alzahrani, S. I., Aljamaan, I. A., & Al-Fakih, E. A. (2020). Forecasting the spread of the covid-19 pandemic in Saudi Arabia using arima prediction model under current public health interventions. Journal of Infection and Public Health, 13 (7), 914–919.

Amram, O., Amiri, S., Lutz, R. B., Rajan, B., & Monsivais, P. (2020). Development of a vulnerability index for diagnosis with the novel coronavirus, covid-19, in Washington State, USA. Health & Place .

Anastassopoulou, C., Russo, L., Tsakris, A., & Siettos, C. (2020). Data-based analysis, modelling and forecasting of the covid-19 outbreak. PloS one, 15 (3), e0230405.

Annas, S., Pratama, M. I., Rifandi, M., Sanusi, W., & Side, S. (2020). Stability analysis and numerical simulation of seir model for pandemic covid-19 spread in Indonesia. Chaos, Solitons & Fractals, 139 , 110072.

Anshari, M., Hamdan, M., Ahmad, N., Ali, E., & Haidi, H. (2023). Covid-19, artificial intelligence, ethical challenges and policy implications. Ai & Society, 38 (2), 707–720.

Apuke, O. D., & Omar, B.(2021). Fake news and covid-19: Modelling the predictors of fake news sharing among social media users. Telematics and Informatics , 56 , 101475.

Arunmozhi, M., Persis, J., Sreedharan, V. R., Chakraborty, A., Zouadi, T., & Khamlichi, H. (2022). Managing the resource allocation for the covid-19 pandemic in healthcare institutions: A pluralistic perspective. International Journal of Quality & Reliability Management, 39 (9), 2184–2204.

Ayoub, J., Yang, X. J., & Zhou, F. (2021). Combat covid-19 infodemic using explainable natural language processing models. Information Processing & Management , 58 (4), 102569. https://doi.org/10.1016/j.ipm.2021.102569 , https://www.sciencedirect.com/science/article/pii/S0306457321000704

Aytaç, U. C., Güneş, A., & Ajlouni, N. (2022). A novel adaptive momentum method for medical image classification using convolutional neural network. BMC Medical Imaging, 22 (1), 1–12.

Bahja, M., Hammad, R., Kuhail, M. A. (2020). Capturing public concerns about coronavirus using arabic tweets: An nlp-driven approach. In: 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), IEEE , pp. 310–315.

Bayham, J., & Fenichel, E. P. (2020). The impact of school closure for covid-19 on the us healthcare workforce and the net mortality effects. Available at SSRN 3555259.

Beare, B. K., & Toda, A. A. (2020). On the emergence of a power law in the distribution of covid-19 cases. Physica D: Nonlinear Phenomena, 412 , 132649.

Bentotahewa, V., Hewage, C., & Williams, J. (2021). Solutions to big data privacy and security challenges associated with covid-19 surveillance systems. Frontiers in Big Data, 4 , 645204.

Bhattacharjee, S. (2020). Statistical investigation of relationship between spread of coronavirus disease (covid-19) and environmental factors based on study of four mostly affected places of China and five mostly affected places of Italy. arXiv:2003.11277

Bhattacharya, S., Maddikunta, P. K. R., Pham, Q.-V., Gadekallu, T. R., Chowdhary, C. L., Alazab, M., Piran, M. J., et al. (2021). Deep learning and medical image processing for coronavirus (covid-19) pandemic: A survey. Sustainable Cities and Society, 65 , 102589.

Born, J., Beymer, D., Rajan, D., Coy, A., Mukherjee, V. V., Manica, M., Prasanna, P., Ballah, D., Guindy, M., Shaham, D. et al. (2021). On the role of artificial intelligence in medical imaging of covid-19. Patterns , 2 (6).

Boyle, F., & Sherman, D. (2006). Scopus ™: The product and its development. The Serials Librarian, 49 (3), 147–153.

Bradshaw, W. J., Alley, E. C., Huggins, J. H., Lloyd, A. L., & Esvelt, K. M. (2021). Bidirectional contact tracing could dramatically improve covid-19 control. Nature Communications, 12 (1), 1–9.

Braithwaite, I., Callender, T., Bullock, M., & Aldridge, R. W. (2020). Automated and partly automated contact tracing: A systematic review to inform the control of covid-19. The Lancet Digital Health , 2 (11).

Capasso, A., Kim, S., Ali, S. H., Jones, A. M., DiClemente, R. J., & Tozan, Y. (2022). Employment conditions as barriers to the adoption of covid-19 mitigation measures: How the covid-19 pandemic may be deepening health disparities among low-income earners and essential workers in the united states. BMC Public Health, 22 (1), 1–13.

Castex, G., Dechter, E., & Lorca, M. (2020). Covid-19: The impact of social distancing policies, cross-country analysis. Economics of Disasters and Climate Change , 1–25.

Castro, M. C., de Carvalho, L. R., Chin, T., Kahn, R., Franca, G. V., Macario, E. M., & de Oliveira, W. K. (2020). Demand for hospitalization services for covid-19 patients in Brazil. MedRxiv .

Chakraborty, K., Bhatia, S., Bhattacharyya, S., Platos, J., Bag, R., & Hassanien, A. E. (2020). Sentiment analysis of covid-19 tweets by deep learning classifiers-a study to show how popularity is affecting accuracy in social media. Applied Soft Computing, 97 , 106754.

Chan, E. Y., & Saqib, N. U. (2021). Privacy concerns can explain unwillingness to download and use contact tracing apps when covid-19 concerns are high. Computers in Human Behavior, 119 , 106718.

Chao, H., Fang, X., Zhang, J., Homayounieh, F., Arru, C. D., Digumarthy, S. R., Babaei, R., Mobin, H. K., Mohseni, I., Saba, L., et al. (2021). Integrative analysis for covid-19 patient outcome prediction. Medical Image Analysis, 67 , 101844.

Chen, T., Rong, J., Peng, L., Yang, J., Cong, G., Fang, J. (2021). Analysis of social effects on employment promotion policies for college graduates based on data mining for online use review in china during the covid-19 pandemic. In: Healthcare , Multidisciplinary Digital Publishing Institute, 9 , p. 846.

Chen, E., Lerman, K., & Ferrara, E. (2020). Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health and Surveillance, 6 (2), e19273.

Chernozhukov, V., Kasahara, H., & Schrimpf, P. (2021). Causal impact of masks, policies, behavior on early covid-19 pandemic in the US. Journal of Econometrics, 220 (1), 23–62.

Chieregato, M., Frangiamore, F., Morassi, M., Baresi, C., Nici, S., Bassetti, C., Bnà, C., & Galelli, M. (2022). A hybrid machine learning/deep learning covid-19 severity predictive model from ct images and clinical data. Scientific Reports, 12 (1), 1–15.

Chiroma, H., Ezugwu, A. E., Jauro, F., Al-Garadi, M. A., Abdullahi, I. N., & Shuib, L. (2020). Early survey with bibliometric analysis on machine learning approaches in controlling covid-19 outbreaks. PeerJ Computer Science, 6 , e313.

Cho, H., Ippolito, D., & Yu, Y. W. (2020). Contact tracing mobile apps for covid-19: Privacy considerations and related trade-offs. arXiv:2003.11511

Chowdhury, N. K., Rahman, M. M., & Kabir, M. A. (2020). Pdcovidnet: A parallel-dilated convolutional neural network architecture for detecting covid-19 from chest x-ray images. Health Information Science and Systems, 8 (1), 1–14.

Cinelli, M., Quattrociocchi, W., Galeazzi, A., Valensise, C. M., Brugnoli, E., Schmidt, A. L., Zola, P., Zollo, F., & Scala, A. (2020). The covid-19 social media infodemic. Scientific Reports, 10 (1), 1–10.

Colizza, V., Grill, E., Mikolajczyk, R., Cattuto, C., Kucharski, A., Riley, S., Kendall, M., Lythgoe, K., Bonsall, D., Wymant, C., et al. (2021). Time to evaluate covid-19 contact-tracing apps. Nature Medicine, 27 (3), 361–362.

Connor, C., De Valliere, N., Warwick, J., Stewart-Brown, S., & Thompson, A. (2022). The cov-ed survey: Exploring the impact of learning and teaching from home on parent/carers’ and teachers’ mental health and wellbeing during covid-19 lockdown. BMC Public Health, 22 (1), 1–15.

Cortés-Martínez, K. V., Estrada-Esquivel, H., Martínez-Rebollar, A., Hernández-Pérez, Y., & Ortiz-Hernández, J. (2022). The state of the art of data mining algorithms for predicting the covid-19 pandemic. Axioms, 11 (5), 242.

COVID, T. I., Reiner, R., Barber, R., & Collins, J. (2020). Modeling covid-19 scenarios for the United States. Nature medicine .

Cuan-Baltazar, J. Y., Muñoz-Perez, M. J., Robledo-Vega, C., Pérez-Zepeda, M. F., & Soto-Vega, E. (2020). Misinformation of covid-19 on the internet: Infodemiology study. JMIR Public Health and Surveillance, 6 (2), e18444.

Cuello-Garcia, C., Pérez-Gaxiola, G., & van Amelsvoort, L. (2020). Social media can have an impact on how we manage and investigate the covid-19 pandemic. Journal of Clinical Epidemiology, 127 , 198–201.

Dar, A. B., Lone, A. H., Zahoor, S., Khan, A. A., & Naaz, R. (2020). Applicability of mobile contact tracing in fighting pandemic (covid-19): Issues, challenges and solutions. Computer Science Review, 38 , 100307. https://doi.org/10.1016/j.cosrev.2020.100307 , www.sciencedirect.com/science/article/pii/S157401372030407X

Dash, S., Chakraborty, C., Giri, S. K., & Pani, S. K. (2021). Intelligent computing on time-series data analysis and prediction of covid-19 pandemics. Pattern Recognition Letters, 151 , 69–75.

de Figueiredo, C. S., Sandre, P. C., Portugal, L. C. L., Mázala-de Oliveira, T., da Silva Chagas, L., Raony, Í., Ferreira, E. S., Giestal-de Araujo, E., Dos Santos, A. A., & Bomfim, P.O.-S. (2021). Covid-19 pandemic impact on children and adolescents’ mental health: Biological, environmental, and social factors. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 106 , 110171.

De Santis, E., Martino, A., & Rizzi, A. (2020). An infoveillance system for detecting and tracking relevant topics from Italian tweets during the covid-19 event. IEEE Access, 8 , 132527–132538.

Desai, P. S. (2021). News sentiment informed time-series analyzing ai (sitala) to curb the spread of covid-19 in Houston. Expert Systems with Applications, 180 , 115104. https://doi.org/10.1016/j.eswa.2021.115104 , www.sciencedirect.com/science/article/pii/S0957417421005455

Devi, V. A., & Nayyar, A. (2021). Evaluation of geotagging twitter data using sentiment analysis during covid-19. In: Proceedings of the second international conference on information management and machine intelligence , Springer, pp. 601–608.

Devi, V. A., & Nayyar, A. (2021). Evaluation of geotagging twitter data using sentiment analysis during covid-19. In: Proceedings of the second international conference on information management and machine intelligence , Springer, pp. 601–608.

Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., & Dietze, S. (2020). Tweetscov19-a knowledge base of semantically annotated tweets about the covid-19 pandemic. In: Proceedings of the 29th ACM international conference on information & knowledge management , pp. 2991–2998.

Durowaye, T. D., Rice, A. R., Konkle, A., & Phillips, K. P. (2022). Public health perinatal promotion during covid-19 pandemic: A social media analysis. BMC Public Health, 22 (1), 1–12.

Elnagar, A., Al-Debsi, R., & Einea, O. (2020). Arabic text classification using deep learning models. Information Processing & Management, 57 (1), 102121.

Elsheikh, A. H., Saba, A. I., Abd Elaziz, M., Lu, S., Shanmugan, S., Muthuramalingam, T., Kumar, R., Mosleh, A. O., Essa, F., & Shehabeldeen, T. A. (2021). Deep learning-based forecasting model for covid-19 outbreak in Saudi Arabia. Process Safety and Environmental Protection, 149 , 223–233.

Ferguson, N. M. Laydon, D., Nedjati-Gilani, G., Imai, N., Ainslie, K., Baguelin, M., Bhatia, S., Boonyasiri, A., Cucunubá, Z., Cuomo-Dannenburg, G., et al. (2020). Impact of non-pharmaceutical interventions (npis) to reduce covid-19 mortality and healthcare demand. imperial college covid-19 response team. Imperial College COVID-19 Response Team , 20 .

Gao, J., Zheng, P., Jia, Y., Chen, H., Mao, Y., Chen, S., Wang, Y., Fu, H., & Dai, J. (2020). Mental health problems and social media exposure during covid-19 outbreak. Plos one, 15 (4), e0231924.

Gencoglu, O. (2020). Large-scale, language-agnostic discourse classification of tweets during covid-19. Machine Learning and Knowledge Extraction, 2 (4), 603–616.

Ghosh, S., & Das, L. C. (2022). Using data mining techniques for covid-19: A systematic. Science and Technology, 8 (2), 36–42.

Google Scholar  

Giordano, G., Blanchini, F., Bruno, R., Colaneri, P., Di Filippo, A., Di Matteo, A., Colaneri, M. (2020). Modelling the covid-19 epidemic and implementation of population-wide interventions in Italy. Nature Medicine 26 (6), 855–860.

Gozes, O., Frid-Adar, M., Greenspan, H., Browning, P. D., Zhang, H., Ji, W., Bernheim, A., & Siegel, E. (2020). Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection & patient monitoring using deep learning ct image analysis. arXiv:2003.05037

Grasselli, G., Pesenti, A., & Cecconi, M. (2020). Critical care utilization for the covid-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response. Jama, 323 (16), 1545–1546.

Guntuku, S. C., Sherman, G., Stokes, D. C., Agarwal, A. K., Seltzer, E., Merchant, R. M., & Ungar, L. H. (2020). Tracking mental health and symptom mentions on twitter during covid-19. Journal of General Internal Medicine, 35 (9), 2798–2800.

Gupta, R., Ibraheim, M. K., & Doan, H. Q. (2020). Teledermatology in the wake of covid-19: Advantages and challenges to continued care in a time of disarray. Journal of the American Academy of Dermatology, 83 (1), 168–169.

Hamzah, F. B., Lau, C., Nazri, H., Ligot, D., Lee, G., Tan, C., Shaib, M., Zaidon, U., Abdullah, A., Chung, M., et al. (2020). Coronatracker: Worldwide covid-19 outbreak data analysis and prediction. Bull World Health Organ , 1 (32).

Haouari, F., Hasanain, M., Suwaileh, R., & Elsayed, T. (2021). ArCOV-19: The first Arabic COVID-19 Twitter dataset with propagation networks. In: Proceedings of the sixth arabic natural language processing workshop, association for computational linguistics , pp. 82–91.

Heikal, M., Torki, M., & El-Makky, N. (2018). Sentiment analysis of arabic tweets using deep learning. Procedia Computer Science, 142 , 114–122.

Hernandez-Matamoros, A., Fujita, H., Hayashi, T., & Perez-Meana, H. (2020). Forecasting of covid19 per regions using arima models and polynomial functions. Applied Soft Computing, 96 , 106610–106610.

Ho, K. K., Chiu, D. K., & Sayama, K. C. (2023). When privacy, distrust, and misinformation cause worry about using covid-19 contact-tracing apps. IEEE Internet Computing, 01 , 1–7.

Hossain, M., Junus, A., Zhu, X., Jia, P., Wen, T. -H., Pfeiffer, D., & Yuan, H. -Y. (2020). The effects of border control and quarantine measures on global spread of covid-19, Alvin and Zhu, Xiaolin and Jia, Pengfei and Wen, Tzai-Hung and Pfeiffer, Dirk and Yuan, Hsiang-Yu. The Effects of Border Control and Quarantine Measures on Global Spread of COVID-19 (3/6/2020) .

Hou, K., Hou, T., & Cai, L. (2021). Public attention about covid-19 on social media: An investigation based on data mining and text analysis. Personality and Individual Differences, 175 , 110701.

Hussain, A., & Sheikh, A. (2021). Opportunities for artificial intelligence–enabled social media analysis of public attitudes toward covid-19 vaccines. NEJM Catalyst Innovations in Care Delivery , 2 (1).

Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment analysis for modern standard arabic and colloquial. arXiv:1505

Ilyas, M., Rehman, H., & Naït-Ali, A. (2020). Detection of covid-19 from chest x-ray images using artificial intelligence: An early review. arXiv:2004.05436

Iwendi, C., Mohan, S., Ibeke, E., Ahmadian, A., Ciano, T., et al. (2022). Covid-19 fake news sentiment analysis. Computers and Electrical Engineering, 101 , 107967.

Jain, R., Gupta, M., Taneja, S., & Hemanth, D. J. (2020). Deep learning based detection and analysis of covid-19 on chest x-ray images. Applied Intelligence , 1–11.

Jamieson, J., Yamashita, N., Epstein, D. A., & Chen, Y. (2021). Deciding if and how to use a covid-19 contact tracing app: Influences of social factors on individual use in Japan. Proceedings of the ACM on Human-Computer Interaction, 5 (CSCW2), 1–30.

Janarthanan, S., Rajendran, M., Biju, T. S., Ravi, N., Sundaramoorthy, K., & Nandan Mohanty, S. (2021). Artificial intelligence (ai) combined with medical imaging enables rapid diagnosis for covid-19. In: Applications of artificial intelligence in COVID-19 , Springer, pp. 55–72.

Kabir, M. Y., & Madria, S. (2021). Emocov: Machine learning for emotion detection, analysis and visualization using covid-19 tweets. Online Social Networks and Media, 23 , 100135. https://doi.org/10.1016/j.osnem.2021.100135 , https://www.sciencedirect.com/science/article/pii/S2468696421000197

Kang, E., Lee, S. Y., Jung, H., Kim, M. S., Cho, B., & Kim, Y. S. (2020). Operating protocols of a community treatment center for isolation of patients with coronavirus disease, South Korea. Emerging Infectious Diseases, 26 (10), 2329.

Katris, C. (2021). A time series-based statistical approach for outbreak spread forecasting: Application of covid-19 in Greece. Expert Systems with Applications, 166 , 114077.

Kiamari, M., Ramachandran, G., Nguyen, Q., Pereira, E., Holm, J., & Krishnamachari, B. (2020). Covid-19 risk estimation using a time-varying sir-model In: Proceedings of the 1st ACM SIGSPATIAL international workshop on modeling and understanding the spread of COVID-19 , pp. 36–42.

Kim, K.-M., & Rhee, H.-S. (2022). Influential factors for covid-19 related distancing in daily life: A distinct focus on ego-gram. BMC Public Health, 22 (1), 1–13.

Koh, J. X., & Liew, T. M. (2020). How loneliness is talked about in social media during covid-19 pandemic: Text mining of 4,492 twitter feeds. Journal of Psychiatric Research . https://doi.org/10.1016/j.jpsychires.2020.11.015 , www.sciencedirect.com/science/article/pii/S0022395620310748

Koh, J. X., & Liew, T. M. (2020). How loneliness is talked about in social media during covid-19 pandemic: Text mining of 4,492 twitter feeds. Journal of Psychiatric Research .

Kucharski, A. J., Russell, T. W., Diamond, C., Liu, Y., Edmunds, J., Funk, S., Eggo, R. M., Sun, F., Jit, M., Munday, J. D., et al. (2020). Early dynamics of transmission and control of covid-19: A mathematical modelling study. The Lancet Infectious Diseases, 20 (5), 553–558.

Kuo, C.-P., & Fu, J. S. (2021). Evaluating the impact of mobility on covid-19 pandemic with machine learning hybrid predictions. Science of The Total Environment, 758 , 144151.

Lai, S., Bogoch, I. I., Ruktanonchai, N. W., Watts, A., Lu, X., Yang, W., Yu, H., Khan, K., & Tatem, A. J. (2020). Assessing spread risk of wuhan novel coronavirus within and beyond China, January-April : A travel network-based modelling study, MedRxiv .

Lamsal, R. (2020). Coronavirus (covid-19) geo-tagged tweets dataset. https://doi.org/10.21227/fpsb-jz61

Lamsal, R. (2020). Coronavirus (covid-19) tweets dataset. https://doi.org/10.21227/781w-ef42

Lamsal, R. (2020). Design and analysis of a large-scale covid-19 tweets dataset. Applied Intelligence , 1–15.

Lazarus, J. V., Ratzan, S. C., Palayew, A., Gostin, L. O., Larson, H. J., Rabin, K., Kimball, S., & El-Mohandes, A. (2021). A global survey of potential acceptance of a covid-19 vaccine. Nature Medicine, 27 (2), 225–228.

Lee, H. S. (2020). Exploring the initial impact of covid-19 sentiment on us stock market using big data. Sustainability, 12 (16), 6648.

Leung, C. K., Kaufmann, T. N., Wen, Y., Zhao, C., & Zheng, H. (2022). Revealing covid-19 data by data mining and visualization, in: Advances in Intelligent Networking and Collaborative Systems: The 13th International Conference on Intelligent Networking and Collaborative Systems (INCoS-2021), Springer, 13 pp. 70–83.

Leung, K., Wu, J. T., Liu, D., & Leung, G. M. (2020). First-wave covid-19 transmissibility and severity in China outside hubei after control measures, and second-wave scenario planning: A modelling impact assessment. The Lancet, 395 (10233), 1382–1393.

Li, L., Yang, Z., Dang, Z., Meng, C., Huang, J., Meng, H., Wang, D., Chen, G., Zhang, J., Peng, H., et al. (2020). Propagation analysis and prediction of the covid-19. Infectious Disease Modelling, 5 , 282–292.

Li, C., Chen, L. J., Chen, X., Zhang, M., Pang, C. P., & Chen, H. (2020). Retrospective analysis of the possibility of predicting the covid-19 outbreak from internet searches and social media data, China, 2020. Eurosurveillance, 25 (10), 2000199.

Liang, W., Fan, Y., Li, K.-C., Zhang, D., & Gaudiot, J.-L. (2020). Secure data storage and recovery in industrial blockchain network environments. IEEE Transactions on Industrial Informatics, 16 (10), 6543–6552.

Lin, L., & Hou, Z. (2020). Combat covid-19 with artificial intelligence and big data. Journal of Travel Medicine , 27 (5), taaa080.

Liu, P., Beeler, P., & Chakrabarty, R. K. (2020). Covid-19 progression timeline and effectiveness of response-to-spread interventions across the united states, medRxiv .

Liu, M., Zhang, Z., Chai, W., & Wang, B. (2023). Privacy-preserving covid-19 contact tracing solution based on blockchain. Computer Standards & Interfaces, 83 , 103643.

López, V., & Čukić, M. (2021). A dynamical model of sars-cov-2 based on people flow networks. Safety Science, 134 , 105034.

Lucivero, F., Hallowell, N., Johnson, S., Prainsack, B., Samuel, G., & Sharon, T. (2020). Covid-19 and contact tracing apps: Ethical challenges for a social experiment on a global scale. Journal of Bioethical Inquiry, 17 (4), 835–839.

Luo, Y., Li, W., Zhao, T., Yu, X., Zhang, L., Li, G., & Tang, N. (2020). Deeptrack: Monitoring and exploring spatio-temporal data: A case of tracking covid-19. Proceedings of the VLDB Endowment, 13 (12), 2841–2844.

Luz, E., Silva, P., Silva, R., Silva, L., Guimarães, J., Miozzo, G., Moreira, G., & Menotti, D. (2021). Towards an effective and efficient deep learning model for covid-19 patterns detection in x-ray images. Research on Biomedical Engineering , 1–14.

Mahalle, P., Kalamkar, A. B., Dey, N., Chaki, J., Shinde, G. R., et al. (2020). Forecasting models for coronavirus (covid-19): A survey of the state-of-the-art.

Mahmud, T., Rahman, M. A., & Fattah, S. A. (2020). Covxnet: A multi-dilation convolutional neural network for automatic covid-19 and other pneumonia detection from chest x-ray images with transferable multi-receptive feature optimization. Computers in Biology and Medicine, 122 , 103869.

Mavragani, A. (2020). Tracking covid-19 in europe: Infodemiology approach. JMIR Public Health and Surveillance, 6 (2), e18941.

Mbunge, E. (2020). Integrating emerging technologies into covid-19 contact tracing: Opportunities, challenges and pitfalls. Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 14 (6), 1631–1636.

Minaee, S., Kafieh, R., Sonka, M., Yazdani, S., & Soufi, G. J. (2020). Deep-covid: Predicting covid-19 from chest x-ray images using deep transfer learning. Medical Image Analysis, 65 , 101794.

Moghadas, S. M. Shoukat, A. Fitzpatrick, M. C., Wells, C. R., Sah, P., Pandey, A., Sachs, J. D., Wang, Z., Meyers, L. A., Singer, B. H, (2020) et al. Projecting hospital utilization during the covid-19 outbreaks in the United States. Proceedings of the National Academy of Sciences , 117 (16) 9122–9126.

Mokbel, M., Abbar, S., & Stanojevic, R. (2020). Contact tracing: Beyond the apps. SIGSPATIAL Special, 12 (2), 15–24.

Mourad, A., & Darwish, K. (2013). Subjectivity and sentiment analysis of modern standard arabic and arabic microblogs. In: Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis , pp. 55–64.

Murphy, R., Calugi, S., Cooper, Z., & Dalle Grave, R. (2020). Challenges and opportunities for enhanced cognitive behaviour therapy (cbt-e) in light of covid-19. The Cognitive Behaviour Therapist , 13 .

Mushtaq, M. F., Fareed, M. M. S., Almutairi, M., Ullah, S., Ahmed, G., & Munir, K. (2022). Analyses of public attention and sentiments towards different covid-19 vaccines using data mining techniques. Vaccines, 10 (5), 661.

Mutlu, E. C., Oghaz, T., Jasser, J., Tutunculer, E., Rajabi, A., Tayebi, A., Ozmen, O., & Garibay, I. (2020). A stance data set on polarized conversations on twitter about the efficacy of hydroxychloroquine as a treatment for covid-19. Data in brief, 33 , 106401.

Mutlu, E., Oghaz, T., Jasser, J., Tutunculer, E., Rajabi, A., Tayebi, A., Ozmen, O., & Garibay, I. (2020). A stance data set on polarized conversations on twitter about the efficacy of hydroxychloroquine as a treatment for covid-19. Data in Brief, 33 , 106401–106401.

Nadim, S. S., Ghosh, I., & Chattopadhyay, J. (2021). Short-term predictions and prevention strategies for covid-19: a model-based study. Applied Mathematics and Computation, 404 , 126251.

Nakov, P., & Da San Martino, G. (2021). Fake news, disinformation, propaganda, media bias, and flattening the curve of the covid-19 infodemic. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining , pp. 4054–4055.

Namasudra, S., Dhamodharavadhani, S., & Rathipriya, R. (2023). Nonlinear neural network based forecasting model for predicting covid-19 cases. Neural Processing Letters , 1–21.

Naseem, U., Razzak, I., Khushi, M., Eklund, P. W., & Kim, J. (2021). Covidsenti: A large-scale benchmark twitter data set for covid-19 sentiment analysis. IEEE Transactions on Computational Social Systems .

Nemes, L., & Kiss, A. (2021). Social media sentiment analysis based on covid-19. Journal of Information and Telecommunication, 5 (1), 1–15.

Oehmke, T. B., Post, L. A., Moss, C. B., Issa, T. Z., Boctor, M. J., Welch, S. B., & Oehmke, J. F. (2021). Dynamic panel data modeling and surveillance of covid-19 in metropolitan areas in the united states: Longitudinal trend analysis. Journal of Medical Internet Research, 23 (2), e26081.

Oliveira, J. F., Jorge, D. C., Veiga, R. V., Rodrigues, M. S., Torquato, M. F., da Silva, N. B., Fiaccone, R. L., Cardim, L. L., Pereira, F. A., de Castro, C. P. et al. (2021). Mathematical modeling of covid-19 in 14.8 million individuals in Bahia, Brazil. Nature Communications 12 (1), 1–13.

Ordun, C., Purushotham, S., & Raff, E. (2020). Exploratory analysis of covid-19 tweets using topic modeling, umap, and digraphs. arXiv:2005.03082

Organization, W. H., et al. (2021). Looking back at a year that changed the world: Who’s response to covid-19, 22 January 2021. Tech. rep.: World Health Organization .

Ouchicha, C., Ammor, O., & Meknassi, M. (2020). Cvdnet: A novel deep learning architecture for detection of coronavirus (covid-19) from chest x-ray images. Chaos, Solitons & Fractals, 140 , 110245–110245.

Padhan, R., & Prabheesh, K. (2021). The economics of covid-19 pandemic: A survey. Economic Analysis and Policy, 70 , 220–237.

Park, Y. J., Choe, Y. J., Park, O., Park, Kim, S.Y., Kim, J., Kweon, S., Woo, Y., Gwack, J., Kim, S. S., et al. (2020). 1440 Contact tracing during coronavirus disease outbreak, South Korea, 2020. Emerging Infectious Diseases, 26 (10), 2465–2468.

Park, J. Y., Mistur, E., Kim, D., Mo, Y., Hoefer, R. (2021). Toward human-centric urban infrastructure: Text mining for social media data to identify the public perception of covid-19 policy in transportation hubs. Sustainable Cities and Society , 103524.

Park, Y.-E. (2022). Developing a covid-19 crisis management strategy using news media and social media in big data analytics. Social Science Computer Review, 40 (6), 1358–1375.

Perumal, V., Narayanan, V., & Rajasekar, S. J. S. (2020). Detection of covid-19 using cxr and ct images using transfer learning and haralick features. Applied Intelligence , 1–18.

Pham, D. P. T., Quang, A. H. N., & Duong, D. (2022). The impact of us presidents on market returns: Evidence from trump’s tweets. Research in International Business and Finance , 101681.

Pirkis, J., John, A., Shin, S., DelPozo-Banos, M., Arya, V., Analuisa-Aguilar, P., Appleby, L., Arensman, E., Bantjes, J., Baran, A., et al. (2021). Suicide trends in the early months of the covid-19 pandemic: An interrupted time-series analysis of preliminary data from 21 countries. The Lancet Psychiatry, 8 (7), 579–588.

Qazi, U., Imran, M., & Ofli, F. (2020). Geocov19: A dataset of hundreds of millions of multilingual covid-19 tweets with location information. SIGSPATIAL Special, 12 (1), 6–15.

Quak, E., Girault, G., Thenint, M. A., Weyts, K., Lequesne, J., & Lasnon, C. (2021). Author gender inequality in medical imaging journals and the covid-19 pandemic. Radiology 204417.

Rehouma, R., Buchert, M., & Chen, Y.-P. P. (2021). Machine learning for medical imaging-based covid-19 detection and diagnosis. International Journal of Intelligent Systems , 5085–5115.

Rocha Filho, T. M., dos Santos, F. S. G., Gomes, V. B., Rocha, T. A., Croda, J. H., Ramalho, W. M., Araujo, W. N. (2020). Expected impact of covid-19 outbreak in a major metropolitan area in Brazil. MedRxiv .

Rovetta, A., & Bhagavathula, A. S. (2020). Covid-19-related web search behaviors and infodemic attitudes in italy: Infodemiological study. JMIR Public Health and Surveillance, 6 (2), e19374.

Russo, L., Anastassopoulou, C., Tsakris, A., Bifulco, G., Campana, E., Toraldo, G., Siettos, C., (2020). T. DAY-ZERO, forecasting the fade out of the covid-19 outbreak in lombardy, Italy: A compartmental modelling and numerical optimization approach. MedRxiv .

Sadler, T. D., Friedrichsen, P., Zangori, L., & Ke, L. (2020). Technology-supported professional development for collaborative design of covid-19 instructional materials. Journal of Technology and Teacher Education, 28 (2), 171–177.

Safdari, R., Rezayi, S., Saeedi, S., Tanhapour, M., & Gholamzadeh, M. (2021). Using data mining techniques to fight and control epidemics: A scoping review. Health and Technology, 11 (4), 759–771.

Samuel, J., Ali, G., Rahman, M., Esawi, E., Samuel, Y., et al. (2020). Covid-19 public sentiment insights and machine learning for tweets classification. Information, 11 (6), 314.

Schultz, M. J., Sivakorn, C., & Dondorp, A. M. (2020). Challenges and opportunities for lung ultrasound in novel coronavirus disease (covid-19). The American Journal of Tropical Medicine and Hygiene, 102 (6), 1162.

Shaar, S., Alam, F., Da San Martino, G., Nikolov, A., Zaghouani, W., Nakov, P., Feldman, A. (2021). Findings of the nlp4if-2021 shared tasks on fighting the covid-19 infodemic and censorship detection. In: Proceedings of the fourth workshop on NLP for internet freedom: Censorship, Disinformation, and Propaganda , pp. 82–92.

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22 , 100104.

Shahid, F., Zameer, A., & Muneeb, M. (2020). Predictions for covid-19 with deep learning models of lstm, gru and bi-lstm. Chaos, Solitons & Fractals , 140 (C), 110212.

Shakibaei, S., De Jong, G. C., Alpkökin, P., & Rashidi, T. H. (2021). Impact of the covid-19 pandemic on travel behavior in istanbul: A panel data analysis. Sustainable Cities and Society, 65 , 102619.

Sharma, K., Seo, S., Meng, C., Rambhatla, S., & Liu, Y. (2020). Covid-19 on social media: Analyzing misinformation in twitter conversations. arXiv:2003

Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., & Liu, Y. (2019). Combating fake news: A survey on identification and mitigation techniques. ACM Transactions on Intelligent Systems and Technology (TIST), 10 (3), 1–42.

Shinde, G. R., Kalamkar, A. B., Mahalle, P. N., Dey, N., Chaki, J., & Hassanien, A. E. (2020). Forecasting models for coronavirus disease (covid-19): A survey of the state-of-the-art. SN Computer Science, 1 (4), 1–15.

Silva, R., Barreira, B., Xavier, F., Saraiva, A., & Cugnasca, C. (2020). Use of econometrics and machine learning models to predict the number of new cases per day of covid-19. In: Anais do XX Simpósio Brasileiro de Computação Aplicada à Saúde , SBC, pp. 332–343.

Singh, R. K., Pandey, R., Babu, R. N. (2020). Covidscreen: Explainable deep learning framework for differential diagnosis of covid-19 using chest x-rays. Neural Computing and Applications , 1–22.

Siwiak, M. M., Szczesny, P., & Siwiak, M. P. (2020). From a single host to global spread. the global mobility based modelling of the covid-19 pandemic implies higher infection and lower detection rates than current estimates. The Global Mobility Based Modelling of the COVID-19 Pandemic Implies Higher Infection and Lower Detection Rates than Current Estimates (3/23/2020) .

Soomro, T. A., Zheng L., Afifi, A. J., Ali, A., Yin, M., & Gao, J. (2022). Artificial intelligence (ai) for medical imaging to combat coronavirus disease (covid-19): A detailed review with direction for future research. Artificial Intelligence Review , 1–31.

Sun, X., Andoh, E. A., & Yu, H. (2021). A simulation-based analysis for effective distribution of covid-19 vaccines: A case study in Norway. Transportation Research Interdisciplinary Perspectives, 11 , 100453.

Tabik, S., Gómez-Ríos, A., Martín-Rodríguez, J. L., Sevillano-García, I., Rey-Area, M., Charte, D., Guirado, E., Suárez, J. L., Luengo, J., Valero-González, M., et al. (2020). Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images. IEEE Journal of Biomedical and Health Informatics, 24 (12), 3595–3605.

Tamal, M., Alshammari, M., Alabdullah, M., Hourani, R., Alola, H. A., & Hegazi, T. M. (2021). An integrated framework with machine learning and radiomics for accurate and rapid early diagnosis of covid-19 from chest x-ray. Expert Systems with Applications, 180 , 115152. https://doi.org/10.1016/j.eswa.2021.115152 , www.sciencedirect.com/science/article/pii/S0957417421005935

Tan, C., & Lin, J. (2023). A new qoe-based prediction model for evaluating virtual education systems with covid-19 side effects using data mining. Soft Computing, 27 (3), 1699–1713.

Tang, Y., & Wang, S. (2020). Mathematic modeling of covid-19 in the United States. Emerging Microbes & Infections, 9 (1), 827–829.

Teng, S., Jiang, N., & Khong, K. W. (2022). Using big data to understand the online ecology of covid-19 vaccination hesitancy. Humanities and Social Sciences Communications, 9 (1), 1–15.

Torres, T. S., Hoagland, B., Bezerra, D. R., Garner, A., Jalil, E. M., Coelho, L. E., Benedetti, M., Pimenta, C., Grinsztejn, B., Veloso, V. G. (2020). Impact of covid-19 pandemic on sexual minority populations in Brazil: An analysis of social/racial disparities in maintaining social distancing and a description of sexual behavior. AIDS and Behavior , 1–12.

Traini, M. C., Caponi, C., & De Socio, G. V. (2020). Modelling the epidemic 2019-ncov event in italy: A preliminary note. MedRxiv .

Tran, C. D., & Nguyen, T. T. (2021). Health vs. privacy? the risk-risk tradeoff in using covid-19 contact-tracing apps. Technology in Society , 67 , 101755.

Turkoglu, M. (2020). Covidetectionet: Covid-19 diagnosis system based on x-ray images using features selected from pre-learned deep features ensemble. Applied Intelligence , 1–14.

Ulhaq, A., Born, J., Khan, A., Gomes, D. P. S., Chakraborty, S., & Paul, M. (2020). Covid-19 control by computer vision approaches: A survey. IEEE Access, 8 , 179437–179456.

Umer, M., Ashraf, I., Ullah, S., Mehmood, A., & Choi, G. S. (2021). Covinet: A convolutional neural network approach for predicting covid-19 from chest x-ray images. Journal of Ambient Intelligence and Humanized Computing , 1–13.

Vafea, M. T., Atalla, E., Georgakas, J., Shehadeh, F., Mylona, E. K., Kalligeros, M., & Mylonakis, E. (2020). Emerging technologies for use in the study, diagnosis, and treatment of patients with covid-19. Cellular and Molecular Bioengineering, 13 (4), 249–257.

Vandeput, N. (2021). 2 forecast kpi. In: Data Science for Supply Chain Forecasting , De Gruyter, pp. 10–26.

Vecino-Ortiz, A. I., Villanueva Congote, J., Zapata Bedoya, S., & Cucunuba, Z. M. (2021). Impact of contact tracing on covid-19 mortality: An impact evaluation using surveillance data from Colombia. Plos one, 16 (3), e0246987.

Verbeek, H., Gerritsen, D. L., Backhaus, R., de Boer, B. S., Koopmans, R. T., & Hamers, J. P. (2020). Allowing visitors back in the nursing home during the covid-19 crisis: A dutch national study into first experiences and impact on well-being. Journal of the American Medical Directors Association, 21 (7), 900–904.

Wahid, M. A., Bukhari, S. H. R., Daud, A., Awan, S. E., & Raja, M. A. Z. (2023). Covict: An iot based architecture for covid-19 detection and contact tracing. Journal of Ambient Intelligence and Humanized Computing, 14 (6), 7381–7398.

Wang, H., Zhang, Y., Lu, S., & Wang, S. (2020). Tracking and forecasting milepost moments of the epidemic in the early-outbreak: Framework and applications to the covid-19, F1000Research 9 .

Wang, Q., Wang, X., & Lin, H. (2020). The role of triage in the prevention and control of covid-19. Infection Control & Hospital Epidemiology, 41 (7), 772–776.

Windsor, L., Benoit, E., Pinto, R. M., & Sarol, J. (2022). Optimization of a new adaptive intervention using the smart design to increase covid-19 testing among people at high risk in an urban community. Trials, 23 (1), 1–16.

Wu, J., Wang, K., He, C., Huang, X., & Dong, K. (2021). Characterizing the patterns of China’s policies against covid-19: A bibliometric study. Information Processing & Management, 58 (4), https://doi.org/10.1016/j.ipm.2021.102562 , www.sciencedirect.com/science/article/pii/S0306457321000650

Yao, Z., Tang, P., Fan, J., & Luan, J. (2021). Influence of online social support on the public’s belief in overcoming covid-19. Information Processing & Management, 58 (4), 102583.

Yasaka, T. M., Lehrich, B. M., & Sahyouni, R. (2020). Peer-to-peer contact tracing: development of a privacy-preserving smartphone app. JMIR mHealth and uHealth, 8 (4), e18936.

Yih, W. K., Daley, M. F., Duffy, J., Fireman, B., McClure, D., Nelson, J., Qian, L., Smith, N., Vazquez-Benitez, G., Weintraub, E., et al. (2023). A broad assessment of covid-19 vaccine safety using tree-based data-mining in the vaccine safety datalink. Vaccine, 41 (3), 826–835.

Zebin, T., & Rezvy, S. (2020). Covid-19 detection and disease progression visualization: Deep learning on chest x-rays for classification and coarse localization. Applied Intelligence , 1–12.

Zeemering, E. S. (2021). Functional fragmentation in city hall and twitter communication during the covid-19 pandemic: Evidence from Atlanta, San Francisco, and Washington, DC. Government Information Quarterly, 38 (1), 101539.

Zeroual, A., Harrou, F., Dairi, A., & Sun, Y. (2020). Deep learning methods for forecasting covid-19 time-series data: A comparative study. Chaos, Solitons, and Fractals, 140 , 110121–110121.

Zhang, C., Xu, S., Li, Z., & Hu, S. (2021). Understanding concerns, sentiments, and disparities among population groups during the covid-19 pandemic via twitter data mining: Large-scale cross-sectional study. Journal of Medical Internet Research, 23 (3), e26482.

Zhao, Y., Cheng, S., Yu, X., & Xu, H.(2020). Chinese public’s attention to the covid-19 epidemic on social media: Observational descriptive study. Journal of Medical Internet Research , 22 (5), e18825.

Zheng, H., Goh, D.H.-L., Lee, C. S., Lee, E. W., & Theng, Y. L. (2020). Uncovering temporal differences in covid-19 tweets. Proceedings of the Association for Information Science and Technology, 57 (1), e233.

Zhong, B., Huang, Y., & Liu, Q. (2021). Mental health toll from the coronavirus: Social media usage reveals wuhan residents’ depression and secondary trauma in the covid-19 outbreak. Computers in Human Behavior, 114 , 106524.

Zhou, C., Su, F., Pei, T., Zhang, A., Du, Y., Luo, B., Cao, Z., Wang, J., Yuan, W., Zhu, Y., et al. (2020). Covid-19: Challenges to gis with big data. Geography and Sustainability, 1 (1), 77–87.

Zhu, X., Zhang, A., Xu, S., Jia, P., Tan, X., Tian, J., Wei, T., Quan, Z., & Yu, J. (2020). Spatially explicit modeling of 2019-ncov epidemic trend based on mobile phone data in mainland China MedRxiv .

Zivkovic, M., Bacanin, N., Venkatachalam, K., Nayyar, A., Djordjevic, A., Strumberger, I., & Al-Turjman, F. (2021). Covid-19 cases prediction by using hybrid machine learning and beetle antennae search approach. Sustainable Cities and Society, 66 , 102669.

Download references

Not Applicable.

Author information

Authors and affiliations.

Department of Computer Science, University of Sharjah, Sharjah, UAE

Imad Afyouni, Ibrahim Hashim & Zaher Aghbari

Department of Computer Science, UAE University, Al Ain, UAE

Tarek Elsaka

Tawuniya, Data Science Specialist Riyadh, Riyadh, Saudi Arabia

Mothanna Almahmoud

Computer Science Department, Al al-Bayt University, Mafraq, 25113, Jordan

Laith Abualigah

MEU Research Unit, Middle East University, Amman, 11831, Jordan

Applied Science Research Center, Applied Science Private University, Amman, 11931, Jordan

Jadara Research Center, Jadara University, Irbid, 21110, Jordan

You can also search for this author in PubMed   Google Scholar


All authors contribute equally.

Corresponding author

Correspondence to Laith Abualigah .

Ethics declarations

Conflict of interest.

The authors declare that there is no conflict of interest regarding the publication of this paper.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Afyouni, I., Hashim, I., Aghbari, Z. et al. Insights from the COVID-19 Pandemic: A Survey of Data Mining and Beyond. Appl. Spatial Analysis (2024). https://doi.org/10.1007/s12061-024-09588-5

Download citation

Received : 30 January 2024

Accepted : 01 June 2024

Published : 22 June 2024

DOI : https://doi.org/10.1007/s12061-024-09588-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Social data mining
  • Forecasting
  • Medical imaging
  • Contact tracing
  • Time series analysis
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Elsevier - PMC COVID-19 Collection

Logo of pheelsevier

Data mining and analysis of scientific research data records on Covid-19 mortality, immunity, and vaccine development - In the first wave of the Covid-19 pandemic

Associated data.

All data and materials included in the article.

Background and aims

Covid-19 is a global pandemic that requires a global and integrated response of all national medical and healthcare systems. Covid-19 exposed the need for timely response and data sharing on fast spreading global pandemics. In this study, we investigate the scientific research response from the early stages of the pandemic, and we review key findings on how the early warning systems developed in previous epidemics responded to contain the virus.

We conducted data mining of scientific literature records from the Web of Science Core Collection, using the topics Covid-19, mortality, immunity, and vaccine. The individual records are analysed in isolation, and the analysis is compared with records on all Covid-19 research topics combined. The data records are analysed with commutable statistical methods, including R Studio’s Bibliometrix package, and the Web of Science data mining tool.

From historical analysis of scientific data records on viruses, pandemics and mortality, we identified that Chinese universities have not been leading on these topics historically. However, during the early stages of the Covid-19 pandemic, the Chinese universities are strongly dominating the research on these topics. Despite the current political and trade disputes, we found strong collaboration in Covid-19 research between the US and China. From the analysis on Covid-19 and immunity, we wanted to identify the relationship between different risk factors discussed in the news media. We identified a few clusters, containing references to exercise, inflammation, smoking, obesity and many additional factors. From the analysis on Covid-19 and vaccine, we discovered that although the USA is leading in volume of scientific research on Covid-19 vaccine, the leading 3 research institutions (Fudan, Melbourne, Oxford) are not based in the USA. Hence, it is difficult to predict which country would be first to produce a Covid-19 vaccine.


We analysed the conceptual structure maps with factorial analysis and multiple correspondence analysis (MCA), and identified multiple relationships between keywords, synonyms and concepts, related to Covid-19 mortality, immunity, and vaccine development. We present integrated and corelated knowledge from 276 records on Covid-19 and mortality, 71 records on Covid-19 and immunity, and 189 records on Covid-19 vaccine.

  • • We analysed scientific research records on Covid-19 with computable statistical methods.
  • • We present visualisations of interrelationships between scientific research data records on Covid-19.
  • • We compare the coverage of research on Covid 19 mortality, immunity, and vaccine development by organisations and countries.
  • • Despite political disagreements, countries (US and China) are still collaborating in Covid-19 research.

1. Introduction

Since the emergence of Covid-19, the scientific literature on this pandemic has increased to a level that information can be overwhelming. While a cure is still nowhere in sight. In this study, we investigated three topics in relation to Covid-19, first mortality, second immunity, third vaccine. There is an argument that Covid-19 immunity can be established by surviving the infection, or through vaccine immunisation, both methods can apparently establish a heard immunity within a population [ 1 ]. However, we don’t know for certain if heard immunity can be achieved with Covid-19. There is a suggestion that immune response could be similar to that of SARS and MERS [ 2 ]. One study suggests that the BCG vaccine used against tuberculosis, could reduce the effects of Covid-19 [ 3 ]. Another study argues that the BCG vaccine may not reduce the Covid-19 mortality [ 4 ].

Given that the most severe inflammation in the lungs is directly caused by the immune response, some articles even argue that ‘good general health may not be advantageous’ when patients have advanced to the severe stage [ 5 ]. In addition, studies on antibody-dependent enhancement are conflicting, because antibodies ‘can enhance virus uptake by cells’ [ 6 ], but also ‘may also inhibit virus entry’ [ 7 ]. One study argues that physical activity and exercise can improve immune response to Covid-19, and ‘prevent viruses and other pathogens from gaining a foothold.’ [ 8 ]. One study even argues that obese patients are more contagious than lean patients, attributing obesity to spread of contagion [ 9 ]. The same study also argues that losing weight, and moderate exercise can improve the immune response to Covid-19.

This is just the beginning of the rising scientific records and information available on Covid-19. Despite the apparent conflict in information, this is a natural process in the early stages of searching for a solution to a fast spreading global pandemic. One things is certain, that is the need to invent a vaccine that can be manufactured and distributed to immunise the entire global community from Covid-19 [ 10 ]. But even if such a vaccine were developed, it would still take between 3 and 10 months for commercialisation [ 11 ]. Alternatively, for heard immunity, we would need to survive a number of repeated waves of Covid-19, until 60–70% of the population can develop immunity [ 12 ]. These are difficult choices, and we can benefit from understanding all the scientific advice, which currently stands at 5208 research papers on the Web of Science Core Collection alone.

In this study, we present a computable statistical analysis of the current scientific research records, and we present conceptual maps that integrate and corelate the knowledge of current studies on Covid-19 mortality, immunity, and vaccine development.

2. Methodology

In this article, we used computable statistical methods for data mining and analysis on Covid-19 scientific literature. The specific focus of the data mining and analysis was on three specific topics: (1) Covid-19 mortality; (2) Covid-19 vaccine; (3) Covid-19 immunity. We performed three different searches on the Web of Science Core Collection, 1 one on each topic. Then we analysed the data records with R Studio. We used the ‘Biliometrix’ package for the statistical analysis [ 13 ]. In our data mining for scientific data records, we used the Web of Science Core Collection. The Web of Science Core Collection contains over 21,100 peer-reviewed, high-quality scholarly journals published worldwide (including Open Access journals) in over 250 sciences, social sciences, and arts & humanities disciplines. For disclosure, we only used the Web of Science Core Collection database, we didn’t download data records from PubMed, or Google Scholar. We hope that other researchers can do that, and compare the results. Journals have a limit on figures, and we exceeded that limit in this article. Analysing different data sets, would produce even more figures. Our rationale was that, since the Web of Science Core Collection contains most accredited journals, we didn’t see how the results would be different today. It could be more useful to run that analysis at a later stage of the pandemic, then compare the visualisations.

3. Bibliometric analysis

Bibliometrics in this article refers to the use of computable statistical methods, to:

  • a) analyse scientific research records and identify research relationships in journal citations;
  • b) to quantitatively assess the most dominant keywords;
  • c) to identify the interrelationship between the problems investigated by different organisations and countries;
  • d) to compare the coverage of research topics by countries;
  • e) to create lists of keywords in groups of synonyms and related concepts (e.g. Covid-19 thesauri); and
  • f) to measure term (keyword) frequencies.

The bibliometric analysis in this article is different from that of scientometrics. Although both research fields study, measure and analyse scientific literature, scientometrics is used to measure the impact of research papers and journals, and to understand scientific citations, usually as measurements in policy and management. Instead, our bibliometric analysis is focused on creating a conceptual map (of keywords, synonyms and related concepts) through factorial analysis, creating a collaboration network map (of countries or organisations), and categorising the keywords, synonyms and related concepts, then relating these groupings to countries or organisations, e.g. in a three-fields plot.

4. Data mining

We used the Web of Science Analyse Results tool to get a fast understanding of the scientific literature data records on Covid-19. For this visualisation, we wanted to analyse all data records on Covid- 19. Hence, our data mining included all data records on Covid-19, without using Booleans (e.g. AND, OR, NOT, SAME, NEAR).

4.1. Rationale for data mining for records periodically - creating snapshots in time

In the Web of Science Analyse Results data mining on the topic of Covid-19, we identified 5210 (on June 01, 2020). The search identified 5208 results produced in 2020, and 2 data records from 2019. We checked the 2 data records, and the published data showed was October 2019. This was puzzling, because we didn’t know about Covid-19 until later date, so we reviewed the two papers manually, and we found that the actual publication date was in March and April 2020. This reemphasised our argument that we should perform regular analysis of data records, because even the most established databases, depend on the journals data. If that data is incorrectly structured, or at least not structured to the requirements of the data base, we could get incorrect readings. In this article, we are creating a snapshot in time, and this is a more reliable representation than searching for data records from specific months at a later stage. For example, if we conducted this data mining prior to March, the two data records from March and April would not have been included, because these data records didn’t exist until March and April - respectively. Adding to this argument, the Web of Science databases are separating data records by year, not by month. In our search, we could not separate data records by month, hence, if this data mining is performed in future years, it would be even more challenging to conduct factorial analysis, three-field plots, and conceptual maps on the scientific data records for this period in time. The results would present all data records from 2020, and we are interested in the analysis of these data records from the first wave, and the early stages of the pandemic. Since the journal editorial and peer-review process last from 3 to 6 months, we can assume that the records we are analysing today, are submitted in the early stages of the pandemic. Hence, we can analyse the keywords, synonyms and related concepts from the first wave of the Covid-19 pandemic.

4.2. Data mining on the topic of Covid-19

In our first visualisation, we used the Web of Science Analyse Results tool. We wanted to identify the funding agencies that supported most scientific research records in the first wave and the early stages of the Covid-19 pandemic (see Fig. 1 ).

Fig. 1

Data mining on the topic of Covid-19: Tree-map of funding agencies that supported most scientific research records.

From Fig. 1 , we can see that when the scientific data research is separated in organisations, Chinese organisations emerges as a leader from this data mining visualisation. We can also see that in Fig. 1 , there are multiple organisations from individual countries. In our initial community discussions, we noticed some negative comments, stating that we have ‘purposely split the two largest US institutions to halve their contribution levels’. We want to re-emphasise that one country, can have multiple institutions. In Fig. 1 , we analysed the data records by organisation, not by nation. In the interest of eliminating such negative comments, we analysed the data records by country in Fig. 2 .

Fig. 2

Data mining on the topic of Covid-19: Bar-graph of countries that supported most scientific research records.

From the bar-graph in Fig. 2 , we can see that the USA has produced the most scientific research. From the bar-graph in Fig. 2 , we notice a significant increase in data records originating from the USA since our earlier study [ 14 ]. It would have been interesting to conduct such analysis at much earlier stages of the pandemic. From the bar-graph in Fig. 2 , we can see that countries that were most affected (USA, China, England, and Italy) produced the most data records - we have noticed a significant increase from the USA since our previous study. We can just wonder if the efforts of the scientific community increased as the pandemic was increasingly more present in those countries. This could signify that the scientific community didn’t act, and ignored the warning signs, until they were faced with the tragedy. However, without conducting a separate data mining of records, specifically from that early stage (January, February, and March), this would remain an open question.

5. Data analysis

In the data mining for analysing the data records with R Studio, we used a more specific search. This resulted with far fewer data records, because in this data mining effort, we used booleans to identify records on the three topics we investigated (mortality, vaccine, and immunity).

5.1. Analysis of scientific data records on Covid-19 and mortality

From our search on TOPIC: (covid 19) AND TOPIC: (mortality), we identified 276 data records. Similarly, to the earlier visualisations, we used the Web of Science Analyse Results tool ( Fig. 3 and Fig. 4 ).

Fig. 3

Data mining on the topic of Covid-19 and mortality: Tree-map of Universities that produced most scientific research records on Covid-19 and mortality.

Fig. 4

Data mining on the topic of Covid-19 and mortality: Tree-map of countries that produced most scientific research records on Covid-19 and mortality.

From the tree-map in Fig. 3 , we can see that Chinese universities are leading the research effort on the topics of Covid-19 and mortality. It would be interesting to compare these results after a period of time, because in our recent study on this topic, we identified through historical analysis of research studies on pandemics and epidemics, that Chinese universities have not been leading on these topics [ 14 ]. Nevertheless, we can clearly see from Fig. 3 that Chinese universities are now strongly dominating the research on these topics. To compliment this analysis, we compare the tree-map from Fig. 3 , with a new tree-map in Fig. 4 , which categorises data records by country.

In Fig. 4 , we can see that the most affected countries (USA, China, England and Italy) are leading the research efforts on Covid-19 and mortality. At this stage, we wanted to investigate these data records further, and we faced limitations in the capabilities of the Web of Science Analyse Results tool. We used the same data record, but with R Studio – in Fig. 5 .

Fig. 5

R Studio - Three-fields plot: left – keywords from the data records, middle – countries, right – authors affiliations.

We used the data records in R Studio, and we created a three-field plot in Fig. 5 , separating keywords of the research studies by country. From Fig. 5 , we can identify which university, is researching specific topics related to Covid-19 and mortality. We wanted to visualise which countries collaborate most. To visualise the collaborative relationships, we plotted this data file in a country collaboration map - in Fig. 6 .

Fig. 6

R Studio – research data records on covid-19 and mortality, country collaboration map.

From the country collaboration map in Fig. 6 , we can see a strong collaboration line between the US and China and Italy, and strong research relationship between China and UK. But surprisingly, this analysis shows that the UK is not collaborating with the US and Italy as strongly as with China. We wanted to evaluate this result further before making any conclusions. To investigate we created a social structure capturing the collaboration network between these specific countries and shown in Fig. 7 .

Fig. 7

R Studio – collaboration network on covid-19 and mortality, with specific countries in the network parameters.

By analysing the specific countries of interest, the collaboration network in Fig. 7 is more detailed, and we can see two different clusters (in green and purple). Although we identified these two clusters, we cannot analyse these collaborations in more detail with the social structure of this collaboration network.

For our final analysis of the data records on Covid-19 and mortality, we wanted to identify the related keywords and their synonyms and ask how these are related. For this, we used factorial analysis to create a conceptual structure map, with multiple correspondence analysis (MCA) – in Fig. 8 .

Fig. 8

R Studio – factorial analysis, conceptual structure map with multiple correspondence analysis (MCA).

From the factorial analysis in Fig. 8 , we can see how different concepts are corelated in research studies. It is worth mentioning that this technique scans for possible keywords in the data records and applies multiple correspondence analysis (MCA) on those identified. Occasionally, such keywords may be irrelevant. For example, it seems that in all the data records, the keywords Wuhan and China were found. Hence, the statistical software extracted these keywords and used them in the factorial analysis. This is not deliberate, it’s just how the statistical software extracts keywords. But it is strange that we don’t see other countries and regions in the conceptual structure map. Could this simply be because of research papers including Wuhan and China in the introduction of the paper? Or could it mean that these studies are using data records shared by medical institutions in Wuhan and China? Without further analysis, this is difficult to know. But since this is not the research objective of this article, we conclude the data analysis on Covid-19 and mortality with the conceptual structure map.

5.2. Analysis of scientific data records on Covid-19 and immunity

The search for data records on Covid-19 and immunity produced only 71 results. The first thing we wanted to identify from these data records were the Covid-19 and immunity related fields shown in Fig. 9 .

Fig. 9

Tree-map of research areas identified from the Covid-19 and immunity scientific data records.

In the first wave, we have seen many discussions in the media on heard immunity, Covid-19 risk factors such as obesity, and the value of daily exercise. We wanted to identify if there is a relationship between any of these topics in the scientific literature. The record count in each area is the total number of articles published. But from categorising the data records in Fig. 9 we cannot identify the content of the articles. Therefore, we used R Studio to analyse this data record further. We designed a co-occurrence network of the keywords from the data records. We didn’t use the articles’ keywords, because the visualisation was generic and didn’t analyse the text, as was done to produce Fig. 9 . Instead, we used the most occurring keywords from the data records text, and we designed the co-occurrence network as a sphere - in Fig. 10 .

Fig. 10

R Studio – Covid-19 and immunity, co-occurrence network of the keywords extracted from the data records.

To build the co-occurrence network as a sphere (in Fig. 10 ), we applied equivalence normalisation, with a sphere as the network layout, and Louvain clustering algorithm. This was our attempt to visualise the relationships, with colour coding the keywords, synonyms and related concepts. Although we can see from Fig. 10 how these keywords, synonyms and related concepts are investigated in the context of Covid-19 and immunity, we don’t see a clear direction in this research field. To analyse the data record further, we applied factorial analysis on this data record as well, but this time we built the conceptual structure map with the multidimensional scaling (MDS) method – in Fig. 11 .

Fig. 11

R Studio – factorial analysis on Covid-19 and immunity, conceptual structure map with the multidimensional scaling (MDS) method.

In Fig. 11 - a conceptual structure map - we can see a few different clusters appearing. The main cluster contains references to exercise, inflammation, smoking, obesity and many additional factors. The multidimensional scaling (MDS) method enables this visualisation of key concepts related to Covid-19 and immunity. These concepts, are extracted from all of the 71 scientific research studies that we found on Covid-19 and immunity. Since the factorial analysis in Fig. 11 is designed from the keywords found in the text of the data records, and not from the keywords provided by authors, we can expect some irrelevant concepts. Again, this is simply how the statistical software works, it extracts all keywords that are found to be repeating in different data records.

5.3. Analysis of scientific data records on Covid-19 and vaccine

The final data record we analysed was on Covid-19 and vaccine. In the data mining for records, we searched the TOPIC: (covid 19) AND TOPIC: (vaccine) and we identified 189 records. As with the previous data records, firstly we analysed the data records by organisations (in Fig. 12 ) and countries (in Fig. 13 ).

Fig. 12

Bar-graph of the leading organisations in the scientific research on Covid-19 vaccine – designed with the Web of Science data mining tool.

Fig. 13

Tree-map of the leading countries in the scientific research on Covid-19 vaccine – designed with the Web of Science data mining tool.

What we can see from Fig. 12 , is that the leading research organisations on Covid-19 vaccine are not based in the leading country on Covid-19 vaccine - in Fig. 13 . Fudan University in Shanghai, University of Melbourne and University of Oxford are the top 3 institutions, with the same output ( Fig. 12 ). While in Fig. 13 , we can see that the USA is the overall leader in research on Covid-19 vaccine. This makes it even harder to predict which country would be first to produce a Covid-19 vaccine.

From Fig. 13 , it seems that the USA which is at present the number one in total research output, is on track to produce a vaccine first – based on the most research output. But if we consider that to produce a new vaccine is a lengthy process, then one could argue that countries and organisations that started earliest, would be best placed to produce a Covid-19 vaccine. This is impossible to predict, but it would be interesting to compare these results over time. This analysis and the visualisations can preserve the present understanding and efforts, and be used as a snapshot in time by future researchers, analysing the Covid-19 research records. Since we cannot answer these questions from the current data records, we applied computable statistical analysis to look for further insights on the relationships between organisations, countries and Covid-19 vaccine. In Fig. 14 , we designed a three-fields plot, with countries on the left, keywords from the data records in the middle, and universities on the right.

Fig. 14

Three-fields plot: countries - left, keywords - middle, universities - right.

In this three-fields plot ( Fig. 14 ), we wanted to identify the relationships between the research findings from the leading organisations (in Fig. 12 ) and compare with the overall national research efforts (in Fig. 13 ). To design the three-fields plot (in Fig. 14 ), we extracted the keywords from all the data records, and we associated the keywords with countries, and organisations. What becomes visible from Fig. 14 , is the lower research output of the US in the keywords that are most present in all data records. Since the US is the leader in the overall research on Covid-19 vaccine at present, we wanted to determine if the US research is focused on different research areas, and not related to the keywords that are taken as most represented in the combined research records from all countries. We continued the analysis with a topic dendrogram (in Fig. 15 ).

Fig. 15

Topic dendrogram on Covid-19 and vaccine keywords.

To design the topic dendrogram in Fig. 15 , we used factorial analysis with multidimensional scaling as the method for parameter analysis of the different keywords. Although we can see how clusters develop from the correlations of the keywords, it is unclear which countries and organisations collaborate to create these clusters. The final data analysis we performed on this data record, was a design of social structure as a collaboration network of the current research efforts on Covid-19 and vaccine development (in Fig. 16 ).

Fig. 16

Covid-19 vaccine social structure of the research collaboration network.

From the collaboration network in Fig. 16 , we can see that despite the current political issues between the USA and China, the scientific research on Covid-19 vaccine is very strong. This brings some optimism in the search for a Covid-19 vaccine. This also supports the previous findings in Fig. 7 , on the USA and China collaboration network on covid-19 and mortality research. It seems that despite the lack of collaboration between the leading countries on Covid-19 research (see Fig. 6 ), the scientific research is ongoing, but possibly the visibility (of these research collaborations) is limited.

6. Discussion

The findings of this study confirm that despite political disagreements, the collaborations on Covid-19 scientific research between the countries leading this research is strong. The study also finds some correlation between the countries worst affected by Covid-19, and countries most productive in Covid-19 research. We can see the same countries showing up in all of the visualisations as leaders. These same countries are also the worst affected by Covid-19. With some exceptions (e.g. Germany), the majority of the countries that were affected worst, produced the most output. There is a second correlation between the data records:, countries tend to produce more output as they get affected by Covid-19. For example, organisations in China are leading in the research efforts in the early stages of the pandemic. But as Covid-19 spread to other countries, and reached the UK, USA, Italy and India, the output of these countries increased. We can expect to see these results changing quickly, and if we repeat the same analysis after some time, we can expect different organisations to be in the lead. The implications for future research from this study are the availability of a repeatable approach for the analysis and visualisation of data records and an inital baseline measurement. Future research can use these visualisations and replicate the analysis with additional data as it becomes available. This study presents a snapshot in time, which will preserve the state of Covid-19 research as of June 01, 2020. Future research dimensions should include, firstly, a comparison of the data records to seek changes in topography. Secondly, future research can use the visualisations and the data records from this study to analyse the historical response from the first wave of Covid-19. The implications for organisations and practitioners is a clear comparison of output, so different organisations can visualise and compare their performance in the first wave of Covid-19. Evaluation of past performance could be used in improving the response in the second wave, or in future pandemics. For example, organisations that reacted slower, can develop new and improved response strategies. The study identified the best performers in terms of volume of research output, and practitioners could learn from these organisations. We really need to learn to start picking up the phone and communicating with other organisations, and learning about the pandemic. In terms of using this research for improving medical systems , we could refer to the factorial analysis, where we extracted and correlated keywords from all data records, to create conceptual maps of how keywords, synonyms and concepts are related. While we hear a lot of different risk factors (e.g. obesity, smoking, etc), we usually see these risk factors in isolation. From the factorial analysis in this study, we can visualise all the risk factors in clusters. Finally, we include a separate section on confirming validity and eliminating bias.

6.1. Eliminating bias and confirming validity

We considered the limitations of qualitative literature review, especially the possible limitations in value caused by bias in collecting data records manually. Data records can be selected to represent a biased viewpoint (e.g. one nation, organisation, or author - trying to show a better performance than others). We instead used a wholly statistical data mining approach. The records we collected from the Web of Science Core Collection, include all data records (as found on the date June 01, 2020) on the three topics we analysed (Covid-19 - mortality; vaccine; and immunity). This eliminated bias in selecting the data records. Since we used well know, and established computable statistical programs, the risk of incorrect representation of sources, and bias in the analysis, was eliminated by the statistical software. In the spirit of reproducible research, we include our data records in the submission of this article. In brief, to eliminate bias, we used a data mining approach. To confirm the validity of the results, we performed a diverse set of computable statistical analysis.

7. Conclusion

In this research study, we conducted data mining with the Web of Science Analyse Results tool and identified 5210 records on Covid-19. We created data visualisations to identify the countries, institutions and organisations leading the scientific research on Covid-19 in terms of volume of research output. In our first visualisation of the data records, Chinese organisations emerged as leaders. By the time we finished the paper, we conducted a second data mining and visualisation. From the second visualisation, we noticed a significant increase in data records originating from the USA. This could represent a higher research output when the effect of the pandemic is higher, but it is hard to determine this with certainty without additional analysis of future data records. Hence, this study presents a snapshot in time, and can be used by future studies to investigate the relationship between research output and spread of the pandemic.

We focused our analysis of scientific research records on identifying research relationships in journal citations, and we confirmed that despite political and trade disputes, there is a strong collaboration on Covid-19 research between the most affected countries. We used R Studio to quantitatively assess the most dominant keywords, and we present visualisations that integrate the keywords from all scientific research studies. With the overwhelming number of studies emerging every day, we identified a process of visualising the keywords from all studies. We used statistical software to measure keyword frequencies and we presented conceptual maps based on the statistical analysis. In the data visualisations, we also presented the keywords in groups of related concepts and we identify the interrelationship between the problems investigated by different organisations and countries. We present the keywords in a dendrogram, presenting different clusters of Covid-19 research and we designed three-field plots, connecting different Covid-19 topics to countries and organisations. Apart from identifying existing collaborations, the data visualisations enable quick identification of related research in different organisations and institutions. This should promote an increased collaboration between organisations and institutions conducting research on similar or related topics, and speed up the research process/output. We also compare the coverage of research topics by countries, and we present visualisation in the form of collaboration maps, structured around the 3 topics we investigated. We used tree-maps and bar-graphs to analyse the data records, in three separate sections, corresponding to Covid 19: mortality, immunity, and vaccine. Then we designed three-fields plots, and country collaboration maps, to analyse the three topics further and we used multiple correspondence analysis (MCA) to create conceptual structure maps of the collaboration networks. The data visualisations presented in this research, can be used to as a snapshot in time by future research studies on this topic. At present, the visualisations can be used to review the state of Covid-19 research, and to navigate through the increasing volume of scientific publications on the state of the pandemic.

7.1. Limitations

This research study is based on the limited data records available on Covid-19 and the topics investigated. Future research should use the findings as a record of the scientific data records in this time period. Research data records are changing every day, and this study wanted to present a snapshot in time, to assist future analysis of the Covid-19 response in the first wave of the pandemic.

The authors acknowledge The Engineering and Physical Sciences Research Council (EPSRC) UK funding [grant number: EP/S035362/1] and the Cisco Systems, USA [grant number DFR05640].

Availability of data and materials

Authors contributions.

Dr Petar Radanliev: main author; Prof. Dave De Roure: supervision; Rob Walton: review and corrections.

Declaration of competing interest

On behalf of all authors, the corresponding author states that there is no conflict nor competing interest.


Eternal gratitude to the Fulbright Visiting Scholar Project.

1 http://apps.webofknowledge.com/WOS_GeneralSearch_input.do?product=WOS&amp;search_mode=GeneralSearch&amp;SID=C3DVq2qEsnSXLxyxR1u&amp;preferencesSaved= .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 20 September 2021

Association mining based approach to analyze COVID-19 response and case growth in the United States

  • Satya Katragadda 1 ,
  • Raju Gottumukkala 1 ,
  • Ravi Teja Bhupatiraju 1 ,
  • Azmyin Md. Kamal 1 ,
  • Vijay Raghavan 1 ,
  • Henry Chu 1 ,
  • Ramesh Kolluru 1 &
  • Ziad Ashkar 1  

Scientific Reports volume  11 , Article number:  18635 ( 2021 ) Cite this article

2721 Accesses

5 Citations

Metrics details

  • Data mining
  • Epidemiology
  • Public health

Containing the COVID-19 pandemic while balancing the economy has proven to be quite a challenge for the world. We still have limited understanding of which combination of policies have been most effective in flattening the curve; given the challenges of the dynamic and evolving nature of the pandemic, lack of quality data etc. This paper introduces a novel data mining-based approach to understand the effects of different non-pharmaceutical interventions in containing the COVID-19 infection rate. We used the association rule mining approach to perform descriptive data mining on publicly available data for 50 states in the United States to understand the similarity and differences among various policies and underlying conditions that led to transitions between different infection growth curve phases. We used a multi-peak logistic growth model to label the different phases of infection growth curve. The common trends in the data were analyzed with respect to lockdowns, face mask mandates, mobility, and infection growth. We observed that face mask mandates combined with mobility reduction through moderate stay-at-home orders were most effective in reducing the number of COVID-19 cases across various states.

Similar content being viewed by others

covid 19 data mining research paper

Public mobility data enables COVID-19 forecasting and management at local and global scales

covid 19 data mining research paper

Analysis of temporal trends in potential COVID-19 cases reported through NHS Pathways England

covid 19 data mining research paper

Identifying and overcoming COVID-19 vaccination impediments using Bayesian data mining techniques


COVID-19 outbreak has brought the world to a standstill. Until the COVID-19 vaccine became available recently, several non-pharmaceutical interventions were used to contain the outbreak, which included stay-at-home orders, social distancing at 6 feet, limited gatherings, hand washing, refraining from touching the face, and masking. Among these, stay-at-home orders potentially carried a high economic cost in lost revenue and financial support to the unemployed 1 . Given the complex dynamics of COVID-19, the high variability of intervention strategies, and the complexity of pandemic behavior, understanding the differential impact of combinations of various measures is non-trivial.

Several methods have been used to study the impact of various non-pharmaceutical interventions and policies on COVID-19 infection growth rate. The initial studies focused on agent-based simulations and statistical correlation analysis. Li et al. used a compartmental model to evaluate the effect of social distancing and cloth face coverings on the spread of infections 2 . Tatapudi et al. presented a study on Miami-Dade County to understand how social-mixing behavior, stay-at-home orders, and contact tracing affect both the case growth and economy 3 . Similar efforts include agent-based simulation developed by Silvia et al. 4 , and Ghaffarzadegan 5 . As more case growth data became available, researchers used data-driven methods to find associations between non-pharmaceutical interventions and case growth data. Correlation and regression-based analysis were performed by Badr et al. 6 and Sarmadi et al. 7 to understand the impact of mobility on infection spread. Bendavid et al. used regression-based analysis to study the impact business closures and stay-at-home orders on epidemic case growth across 10 countries, including the United States 8 . Regression-based methods were used to understand the association between the timing of mandated lockdown orders 9 , social, economic, and demographic determinants 10 , and the spread of COVID-19 across various counties in the United States. The dynamics of incidence and mortality rates were also found to vary across regions in the United States 11 . James and Menzies studied the second surge in COVID-19 cases to understand the evolutionary patterns using time-series analysis and hierarchical clustering. The second surge revealed common characteristics of states that were most and least successfully managed COVID-19 12 .

Several studies analyzed the impact of face mask usage on the number of COVID-19 cases. A recent study found that a universal mask mandate would help alleviate the worst effects of epidemic resurgence in many states across the United States 13 . Fischer et al. applied logistical regression-based models on mask-wearing and social distancing guidelines and found that states with mask adherence \(\ge \) 75% had 140 fewer cases per capita than states with less than 75% for mask adherence 14 . Dasgupta et al. used Poisson regression models to examine associations between the implementation of community mitigation policies and identification of a county as a rapid riser and found that counties in states that closed for fewer days (0 to 59) and had no mask mandate at reopening had a higher probability of becoming a rapid riser county 15 . Another study on 198,077 participants across the United States used hazard ratio to find associations between community-level social distancing measures and individual face mask use with reduced risk of COVID-19 surge 16 . Krishnamachari et al. examined the impact of school closures, stay-at-home orders, and mask mandates based on the length of the mandate on cumulative incidence rates of COVID-19 in all states in the US using negative binomial regression 17 . Lyu and Wehby compared the case growth rate between states with and without mask mandates during the pandemic using a regression-based approach 18 . Guy et al. used weighted least-squares regression to measure the impact of various policies like mask mandates and on-premises dining across 38 states in the US with the change in the case and death rates before and after the implementation of the policies 19 . Most of these studies look at the adherence to masks or social distancing guidelines across various counties and states and its impact on the number of cases. Rather than analyzing the impact of one or two non-pharmaceutical interventions, it is important to analyze the association between the combination of multiple interventions and local infection dynamics. To accomplish this, this paper introduces an association mining approach to analyze similarities across various policies and infection rates in communities for various phases of the pandemic.

Association rule mining (ARM) is a common data mining technique used to discover similarities and dissimilarities among objects 20 . The approach was originally designed to obtain insights into consumer buying habits, such as understanding the groups of products customers would buy together 20 . The approach later garnered interest in many domains 21 , 22 , 23 , 24 . Recently in public health, ARM was used to analyze the relationship between environmental stressors and adverse human health impacts 25 .

We used an ARM approach to analyze how various non-pharmaceutical interventions contributed to infection growth. Rather than offer clear hypothesis-based objectives, the proposed technique provides insights into similarities and dissimilarities among various combination of policies and local conditions that led to an increase or decrease in infection rates. We use publicly available data collected from all 50 states to discover common patterns with respect to similarities between six different factors, namely stay-at-home-orders, face masks, population density, mobility, and infection rates on future infection rates across various states in the United States.

Data and methods

Association mining allows us to perform a descriptive analysis of patterns between various factors known to influence infection growth rate and the actual infection growth rate. We specifically looked at population density, infection rate, face mask orders, stay-at-home orders, and mobility 6 , 26 , 27 , 28 , 29 .

Association rule mining

Given a dataset containing a collection of records or transactions, each record comprises a set of categorical attributes. One of the attributes is the target attribute of interest. The association rule may be denoted by \(A \Rightarrow B\) , where A (the antecedent or LHS) and B (the consequent or RHS) are sets of various attribute-value pairs (also called itemsets), and are disjoint. The rule represents the hypothesis that when variables in A occur in the dataset, the variables in B also occur. Association mining generates a large number of rules from a given dataset. In a dataset with m attributes ( \(n-1\) antecedents and one consequent), each with n values, each can generate a maximum of \(nm^{(n-1)}-1\) rules. However, not all rules are significant. The goal of this approach is to find rules that have high practical significance. To eliminate spurious rules, we use three measures: support, confidence, and lift. In addition, we also use the chi-squared test to measure the statistical significance of the association between the antecedent and the consequent.

Given two disjoint sets of attribute-value pairs A and B , and an association rule \(A \Rightarrow B\) ; support of the rule refers to the number of records where the attribute-value pairs in either set A or B appear in the dataset relative to the total number of records (transactions or instances). This denotes the prevalence of the rule in the dataset. By definition, the support value is symmetric (i.e., support of both rules \(A \Rightarrow B\) and \(B \Rightarrow A\) are equal). Similarly, support(A) is the total number of records containing the itemset A to the total number of records in the dataset. The confidence of the rule \(A \Rightarrow B\) measures the conditional probability of B , given A . Thus, the confidence measure for a given rule is asymmetric.

Lift is the ratio between the observed support and the expected support between the independent variables A and B . A \(lift > 1\) implies a greater degree of dependence whereas, a \(lift < 1\) indicates negative dependence, and \(lift = 1\) shows that A and B are independent. Lift is also a symmetric measure between the itemsets A and B .

In addition to lift, the chi-squared test has also been used to measure the statistical significance level of the dependence between antecedent and consequent in association rules 30 , 31 . However, it should be noted that the chi-squared test, being a symmetrical measure, does not measure the dependence of the antecedent and consequent of a rule which is provided by confidence measure from Eq. ( 3 ). The chi-squared value of an association rule \(A \Rightarrow B\) is defined by Alvarez 31 as a factor of support, confidence, and lift measures and is provided below:

where n is the total number of transactions in the dataset. The association between the antecedent and the consequent is considered significant if the chi-squared value is greater than a threshold determined by the chi-squared distribution. For an association rule, the degrees of freedom for an association rule is one?.

In this paper, we model face-covering orders, social distancing orders, mobility, population density, case level, and the current incident phase as the contributing factors (i.e., the antecedent). The target variable (the consequent) is the future incident growth phase. One of the critical assumptions for ARM is that all the values of attributes are discrete. We discretized the numerical data used in the study (i.e., mobility, number of cases per capita) into five quantiles. We also discretized the continuous data of infection growth curve into five phases based on the logistic growth model.

Data collection and preprocessing

Our study includes weekly aggregated data from all the 50 states within the United States between June 1st and November 15th, 2020. We start our data collection on June 1st because including earlier data may skew our analysis (only eight states had a mask mandate before June and most of the states were under lockdown 32 ). We end our study period on November 15th before the start of the winter holiday season. Discretized attributes, values, and the frequency distribution of each attribute-value pair are presented in Table 1 .

We used the official face-covering orders issued by various governors or local authorities from AARP State-by-State Guide to Face Mask Requirements 33 and Masks4All compilation 34 . We rounded the dates to the start of the workweek. The four categories of mask orders are No-Mask, county-wide, recommended (state-wide), and mandated (state-wide). The discretized dataset we produced and detailed definitions of each of these orders were provided on GitHub 35 . We illustrated the state mask mandate variation across all the states in Fig. 1 .

figure 1

Timeline of various mask mandates issued across all the states in the United States.

State reopening

All states initiated a strict lockdown at the beginning of the pandemic in March 2020. The states modified these orders based on the perceived risk of cases, hospitalizations, and deaths while also trying to bring back the economy. States mostly adapted the guidelines provided by the White House COVID-19 task force reopening procedures 36 , 37 . The specific orders that were considered include Phase-0, Phase-1, Phase-2, Phase-3, Phase-4, and Phase-5. Detailed definitions of each of these orders were provided at this webpage 35 .

Mobility levels

The mobility information was from the Descartes Labs, a popular dataset used by several studies for analyzing the relationship between mobility and COVID-19 case growth 4 , 38 , 39 . The dataset uses anonymized mobile device locations to calculate a local mobility metric. The metric represents the median of the max-distance traveled by individuals at the state and county level normalized to the metric before the pandemic 40 .

Population density

The population density of each state represents the number of people per square mile of land area based on the 2020 population estimates 41 .

Cases per capita

We extracted the official COVID-19 weekly case data from June 1st to November 10th for the United States from the Johns Hopkins University Dashboard 42 . We calculated the per capita cases based on the estimated 2019 US Census population data.

Incidence phases

We discretized the incidence growth rate of the pandemic into five phases based on the standard intervals obtained from a logistic growth curve 43 , 44 . Given the states have multiple peaks, we use a multi-peak-based logistic growth model from Batista et al. 43 to obtain discrete phases. Phase-I is called the early-growth phase (or ascending) where (b) Phase-II is the fast-growth phase which falls between the end of the lag phase (or slow growth phase) and the peak (c) Phase-III is the decline phase where the cases decrease from fast-growth to steady-state, (d) Phase-IV – steady-state and finally (e) Phase-V is the ending phase . We illustrated the first 4 phases for the state of Arizona in Fig. 2 ; the fifth phase is not visible in the image.

figure 2

Logistic growth model applied to the state of Arizona.

The incidence growth can be envisioned as transitions between various growth phases. Once the incidence curve goes into fast-growth phase, the public health officials intervene to flatten the curve using warnings/outreach for people to stay home or promote face mask converting. The study considers both the current and future incidence phases for association rule mining. The current phase is part of the antecedent, and the future phase is the consequent/target variable with a lag of 4 weeks. Based on a preliminary analysis, we found that the mobility, reopening mandates, and other factors are correlated with the number of cases with a lag of 4 weeks.

We collected 25 weeks of data, June 1, 2020, to November 15, 2020, across all 50 states. Since the future incidence phase is lagged by four weeks, we ended up with 21 weeks of transactional data. The dataset thus has 1050 transactions, with each transaction corresponding to 21 weeks for each of the 50 states. An example rule would be, \( Mask Usage: state- wide\, \& \,Current\,Phase: early- growth \Rightarrow Future\,Phase:early- growth\) . This rule implies that when a state-wide mask mandate is active and the state is in the early-growth phase, the state would remain in the early-growth phase. Mask usage, current phase, and future phase are the attributes. State-wide and early-growth are the corresponding values for mask mandate and current incidence phase, respectively. The antecedents in the dataset are mask mandates, state re-openings, mobility levels, case levels, population density, and current incidence rate. The consequent or the target variable is the future incidence rate. In this analysis, we set the minimum support threshold to 0.01. This means that the combination of factors in the antecedent and the consequent should appear in at least ten transactions (ten weeks of data) to be considered important. This threshold could mean that the antecedent can appear across 10 weeks in a single state or 1 week across 10 states or any combination in between. The minimum confidence is 0.7, and the minimum lift is 1.

429 out of 55,125 relationships generated from the original transactions met the minimum threshold levels described in the Data and Methods section (support of 0.01, confidence of 0.7, and a lift value greater than 1). Each of these rules appeared in at least 10 transactions, i.e., 10 weeks of observations across the United States. With a confidence score of 0.7, each of the consequent (RHS) appears in at least 70% of the transactions with the antecedent (or the LHS). Finally, a high lift score (greater than 1) tells us that the factors in the antecedent are sufficiently positively correlated for deriving conclusions from the data.

Table 2 shows the top 5 association rules for various combinations of current and future incidence phases. These rules show various factors that contributed to the infection growth pattern, which is represented as one of four phases (i.e., early-growth, fast-growth, decline, and steady-state). Of the 8 possible combinations between the current and the future incidence phases, we observe strong association rules that satisfy the minimum thresholds described above for 5 combinations: continued early-growth, early-growth to fast-growth, continued fast-growth, continued decline, and steady-state to early-growth. In Table 2 , the first five rules highlight the circumstances where the incidence of cases stays constant, continuing in the same phase. The next five rules highlight scenarios where the incidence rate increases in the early-growth phase and transitions into the fast-growth phase. We also present the support, confidence, and lift values for each of these rules. These represent the rule’s coverage, strength, and predictive power, respectively, along with the chi-squared value of that rule. Given an antecedent and a consequent of a rule, the critical value of \(\chi ^2\) is 3.841 for a significance of p<0.05 45 . A chi-squared value greater than 3.841 implies that the association between the antecedent and consequent in a rule is significant. All the association rules presented in Table 2 are significant.

We observed five combinations of current and future phases in the extracted association rules. The following are a summary of interesting observations:

Continued Early-Growth These rules represent the scenarios in which the number of cases continues to grow at a constant rate. The most important rule (i.e., 11% support and 97% confidence) shows that a state can remain in an early-growth phase even when there is a mask mandate. Another rule with lower support (5% support and 76% confidence) represents a scenario where states remain in the early-growth phase without a mask mandate and high mobility. In addition, the rules in the continued early-growth phase also demonstrate that states with a mask mandate, along with high mobility, medium-case levels, and phase-3 social distancing, will also continue in the early-growth phase.

Early-Growth to Fast-Growth Here, the number of cases increase rapidly, leading to an explosion in the number of new cases. The top 5 rules that contributed to the fast-growth phase from the early-growth phase have no mask mandates as the underlying common factor. Moreover, these rules have strong support and high confidence when no-mask is combined with low mobility, strict social distancing guidelines (i.e., phase 0), and a low number of cases.

Continued Fast-Growth When a state is in a fast-growth phase, we did not observe a specific combination of factors that lead to a decrease in the number of cases.

Continued Decline When case counts were decreasing, the top 5 rules have either a county-level or a state-level mask mandate. We observed this pattern alongside multiple factors (high mobility, high case levels, and relaxed social distancing guidelines).

Steady-State to Early-Growth When the states transitioned from a steady-state to the early-growth stage (indicating a resurgence in COVID 19 cases), we observed all the top 5 rules had a no-mask mandate. Other antecedents for these rules include a combination of a lower number of cases, strict social distancing guidelines, and very high mobility.

We used a Sankey diagram to illustrate the combination of factors that contribute to different infection growth phases in Fig. 3 . We present the contributing factors on the left and the resulting phase from the combination of contributing factors on the right. The width of the edge between the antecedent and the consequent represents the rules frequency for the given antecedent and consequent set. The flow lines show the relative strength of different factors (mask mandates, local mobility, population density, and social distancing orders) that contribute to the future incidence phase. The higher the number of rules for a particular variable, the larger the impact of that variable in affecting the outcome in the incidence. For example, in the case of state-wide face mask mandate, the highest number of rules (77 rules) are associated with the early-growth phase, followed by the fast-growth phase (66 rules), and the declining phase has the least number of rules (12 rules) in the dataset. The following are some interesting observations from Table 2 .

Rules with no mask mandate were only associated with either an early-growth phase (54.34%) or a fast-growth phase (45.65%). There were no rules with a no-mask mandate where the future incidence phase is a decline phase or a steady-state phase.

In comparison, the rules with mask mandates (state-wide and countywide) were associated with all three future incidence phases: early-growth, fast-growth, and decline phases with 52.12%, 35.1%, and 12.76% rules in each phase, respectively.

Reopening guidelines issued by the states were strongly associated with specific phases of the pandemic. Strict guidelines instituted during Phase 0 were always associated with rules in the early-growth and the fast-growth phases, as most states imposed strict lock-downs as the number of cases started to increase. On the other hand, the incidence of cases increased when these restrictions were relaxed. Phase 3 and 4 reopening guidelines led to a resurgence in the incidence (early-growth and fast-growth) in 87.74% of the rules, and a decrease in incidence was observed in 12.24% of the rules.

Mobility has a considerable impact in determining the future phase of the pandemic. Lower mobility was associated with the early-growth phase, 3.2% of the total rules associated with low or very low mobility compared with 80.6% of rules leading to a fast-growth phase, and 16.12% of rules where the future phase is a decline phase. On the other hand, the rules with medium or higher mobility were associated mainly with future phases leading to early-growth, fast-growth, and decline phases 65.7%, 30.09%, and 3.3%, respectively. These distributions imply that lower mobility was associated with a decline in the number of cases, while higher mobility was associated with an increase in the number of cases.

figure 3

Association of various variables to the antecedent (future incidence curve of the pandemic).

COVID-19 policies with respect to vaccinations, mobility restrictions, shutdowns, mask mandates, etc., are currently the nation’s highest priorities towards saving lives and protecting the economy. Identifying and profiling the combination of policies that worked and did not work is important. This provides the necessary data for a rational decision support framework on how best to manage policies at the state level, given their diverse attributes. While the existing studies provide individual correlations, associations, forecasting, etc., they do not provide insights into effective combinations. The goal of our proposed method is to improve this understanding to aid policymakers in making the right decisions to help minimize spread while balancing convenience and economic growth priorities.

Relationship between case-growth and mask mandates

Based on the association rules in Table 2 , no mask mandates were always associated with an increase in the number of cases, and mask mandates were associated with a decrease in the number of cases. While it is not clear which specific measures led to a decrease in the number of cases, the mask mandates were always associated with a continued decline in the number of new cases. Most of the states issued a mask mandate when the number of cases was increasing rapidly, alongside stay-at-home orders. This observation is in line with earlier research showing that strong social distancing measures reduced the number of cases. However, the effect of mask mandates separate from social distancing measures is not apparent in the fast-growth phase. This was because the two measures were typically instituted together when the cases were increasing. For this reason, we cannot assess the differential contributions of these measures. We observed that the mask mandates were effective in the early-growth and decline phases of the pandemic. We also observed that the states that did not institute a mask mandate continued to see an increase in the number of cases for a longer duration than the states that did. Figure 4 shows the relationship between the number of cases per capita and the length of time the mask mandates were active in the different states. The color of the map shows the population density of a state, and the size shows the number of cases in that state. We observe that the longer the duration for which the mask mandates were active, the lower were the number of cases per capita. We also observed that states with high population densities that instituted a mask mandate had a lower number of cases per capita.

figure 4

Relationship between the number of cases per capita and the number of weeks a mask mandate is active in a state.

Relationship between mobility and case-growth

Our results shown in both Table 2 and Fig. 3 indicate that mobility also impacts the incidence rate of the pandemic. The association rules indicate that increased mobility and a lack of mask mandates were associated with a resurgence of cases. A majority of the states in the United States successfully controlled the spread of the pandemic in spring and summer with strict social distancing guidelines and the resultant reduction in mobility. However, all the states had an increase in the number of cases in October and November, despite having issued mask mandates at state and county levels. This was likely related to increased mobility during this time period. In states that did not institute mask mandates, there was an increase in the number of cases irrespective of the mobility levels or the social distancing guidelines issued by the state and local authorities. By this, we surmise that social distancing and masking regulations were by themselves inadequate to reduce the number of new cases.

Figure 5 shows the relationship between the number of cases per capita and the median of maximum mobility for that state at a weekly level of granularity. The size of each marker shows the total number of cases, and the color indicates the number of weeks that state had a mask mandate. The states with mobility lower than 80 percent of the baseline had a lower number of cases per capita compared to states that had higher mobility. The states with the highest mobility, i.e., South Dakota, North Dakota, Wyoming, and Montana, were also the states with a considerably higher number of cases. These observations indicate that while mask mandates are essential, reducing the mobility of individuals and strict regulations on the businesses open also had a significant association with a reduction in the number of cases.

figure 5

Impact of number of weeks mask mandate was active and the mobility on the number of cases per capita.

The states that did not institute mask mandates did not also impose strict social distancing guidelines or relaxed the guidelines earlier than most of the other states. These include states like South Dakota, Mississippi, North Dakota, and Utah. Both North Dakota and Utah imposed strict state-wide mask mandates in mid-November when the number of cases increased exponentially. Our results in Table 2 and Fig. 3 show the effect that various mask mandates, socials distancing guidelines, and mobility had on the change in the growth rate of the pandemic.

Limitations and future work

We emphasize the limited scope of our analysis, as it is important to interpret these results with a clear understanding of the limitations with respect to both the data quality and the methodology.

Our data includes the start and end dates of various interventions by state and local authorities, but this does not help us measure the actual compliance to these measures. In the case of mask mandates issued at a county level, in a majority of the states, the population under the coverage of the mandates or recommendations is not known. We also did not consider several other conditions that affect growth in cases. For example, the analysis does not consider events such as holidays, weather conditions, congregation events, etc. Our assumptions about the incidence growth phase are based on the best fit from the logistic growth model.

In ARM, the choice of parameters (i.e., support and confidence thresholds) affect the rules generated 46 . If the thresholds are set too high, then we obtain very few rules. If the thresholds are set too low, we obtain too many rules. To make the analysis less susceptible to thresholds, we used the top 5 rules to study the impact of various factors to account for changes in phases of the pandemic. The discretization of variables also affects the type of rules generated. For instance, using just three classes (low, medium, and high) rather than five classes (very low, low, medium, high, and very high) produces a very different set of rules. We use five-class categorization using symmetric quantiles to discretize the variables and found them to yield better quality rules. In the future, a supervised discretization technique based on the strength of association rules can be used to further improve the quality of the rules generated. Future work can explore sensitivity analysis towards this goal. This approach provides a new direction to develop AI-based techniques that can provide policy recommendations for policymakers on various actions that could potentially decrease the number of new cases.

We introduced a novel approach to analyze the effects of different non-pharmaceutical interventions to contain and manage the infection growth rate. The approach uses the association rule mining technique and discretization of infection growth phases, using a multi-peak logistic growth model. We made several interesting observations. For instance, there is a strong similarity between states that had strict mask mandates and reduced infection growth rates. Also, no difference was observed in terms of infection growth rate between state-wide versus county-wide mask mandates. Various other factors such as population density and mobility levels impacted the increase in the number of cases, highlighting the importance of local factors on the number of COVID-19 cases. These findings are important as the United States is trying to reach herd immunity through vaccination, while balancing against a growing resistance towards measures from various state level administrations and an exhausted population.

Code availability

The analysis code for this paper is available on GitHub at https://github.com/raviteja-bhupatiraju/AssociationMining_COVID19 .

Fernandes, N. Economic effects of coronavirus outbreak COVID-19 on the world economy. SSRN 2020. https://doi.org/10.2139/ssrn.3557504.

Li, J. et al. Do stay at home orders and cloth face coverings control COVID-19 in New York City? Results from a SIER model based on real-world data. Open Forum Infectious Diseases 8, (2021). https://doi.org/10.1093/ofid/ofaa442.

Tatapudi, H., Das, R. & Das, T. K. Impact assessment of full and partial stay-at-home orders, face mask usage, and contact tracing: An agent-based simulation study of COVID-19 for an urban region. Global Epidemiol. 2 , 100036. https://doi.org/10.1016/j.gloepi.2020.100036 (2020).

Article   Google Scholar  

Silva, P. C. et al. COVID-ABS: An agent-based model of COVID-19 epidemic to simulate health and economic effects of social distancing interventions. Chaos Solitons Fractals 139 , 110088. https://doi.org/10.1016/j.chaos.2020.110088 (2020).

Article   MathSciNet   PubMed   PubMed Central   Google Scholar  

Ghaffarzadegan, N. Simulation-based what-if analysis for controlling the spread of COVID-19 in universities. PLoS ONE 16 , e0246323. https://doi.org/10.1371/journal.pone.0246323 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Badr, H. S. et al. Association between mobility patterns and COVID-19 transmission in the USA: A mathematical modelling study. Lancet Infect. Dis. 20 , 1247–1254. https://doi.org/10.1016/S1473-3099(20)30553-3 (2020).

Sarmadi, M., Marufi, N. & Moghaddam, V. K. Association of COVID-19 global distribution and environmental and demographic factors: An updated three-month study. Environ. Res. 188 , 109748. https://doi.org/10.1016/j.envres.2020.109748 (2020).

Bendavid, E., Oh, C., Bhattacharya, J. & Ioannidis, J. P. Assessing mandatory stay-at-home and business closure effects on the spread of COVID-19. Europ. J. Clin. Invest. 51 , e13484. https://doi.org/10.1111/eci.13484 (2021).

Article   CAS   PubMed   Google Scholar  

Trivedi, M. & Das, A. Did the timing of state mandated lockdown affect the spread of COVID-19 infection? A county-level ecological study in the United States. J. Prev. Med. Public Health 54 , 238–244. https://doi.org/10.3961/jpmph.21.071 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Andersen, L. M., Harden, S. R., Sugg, M. M., Runkle, J. D. & Lundquist, T. E. Analyzing the spatial determinants of local COVID-19 transmission in the United States. Sci. Total Environ. 754 , 142396. https://doi.org/10.1016/j.scitotenv.2020.142396 (2021).

Article   ADS   CAS   PubMed   Google Scholar  

Cuadros, D. F., Branscum, A. J., Mukandavire, Z., Miller, F. D. & MacKinnon, N. Dynamics of the COVID-19 epidemic in urban and rural areas in the United States. Ann. Epidemiol. 59 , 16–20. https://doi.org/10.1016/j.annepidem.2021.04.007 (2021).

James, N. & Menzies, M. Covid-19 in the United States: Trajectories and second surge behavior. Chaos Interdisciplin. J. Nonlinear Sci. 30 , 91102. https://doi.org/10.1063/5.0024204 (2020).

Article   MathSciNet   CAS   Google Scholar  

Forecasting team I. C. Modeling COVID-19 scenarios for the United States. Nat. Med. https://doi.org/10.1038/s41591-020-1132-9 (2020).

Fischer, C. B. et al. Mask adherence and rate of COVID-19 across the United States. PLoS ONE 16, (2021). https://doi.org/10.1371/journal.pone.0249891

Dasgupta, S. et al. Differences in rapid increases in county-level COVID-19 incidence by implementation of statewide closures and mask mandates - United States, june 1-september 30, 2020. Ann. Epidemiol. 57 , 46–53. https://doi.org/10.1016/j.annepidem.2021.02.006 (2021).

Kwon, S. et al. Association of social distancing and face mask use with risk of COVID-19. Nat. Commun. 12 , 1–10. https://doi.org/10.1038/s41467-021-24115-7 (2021).

Article   CAS   Google Scholar  

Krishnamachari, B. et al. The role of mask mandates, stay at home orders and school closure in curbing the COVID-19 pandemic prior to vaccination. Am. J. Infect. Control 49 , 1036–1042. https://doi.org/10.1016/j.ajic.2021.02.002 (2021).

Lyu, W. & Wehby, G. L. Community use of face masks and COVID-19: Evidence from a natural experiment of state mandates in the us: Study examines impact on COVID-19 growth rates associated with state government mandates requiring face mask use in public. Health Affairs 39 , 1419–1425. https://doi.org/10.1377/hlthaff.2020.00818 (2020).

Article   PubMed   Google Scholar  

Guy, G. P. Jr. et al. Association of state-issued mask mandates and allowing on-premises restaurant dining with county-level COVID-19 case and death growth rates-United States, March 1–December 31, 2020. Morbidity Mortality Weekly Rep. 70 , 350. https://doi.org/10.15585/mmwr.mm7010e3 (2021).

Agrawal, R., Imieliński, T. & Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data , pp. 207–216, https://doi.org/10.1145/170036.170072 (1993).

Brossette, S. E. et al. Association rules and data mining in hospital infection control and public health surveillance. J. Am. Med. Inf. Assoc. 5 , 373–381. https://doi.org/10.1136/jamia.1998.0050373 (1998).

Paetz, J. & Brause, R. A frequent patterns tree approach for rule generation with categorical septic shock patient data. In International Symposium on Medical Data Analysis , pp. 207–213, https://doi.org/10.1007/3-540-45497-7\_31 , (Springer, 2001).

Chen, J., He, H., Williams, G. & Jin, H. Temporal sequence associations for rare events. In Pacific-Asia Conference on Knowledge Discovery and Data Mining , pp. 235–239. https://doi.org/10.1007/978-3-540-24775-3\_30 (Springer, 2004).

Ordonez, C., Ezquerra, N. & Santana, C. A. Constraining and summarizing association rules in medical data. Knowl. Inf. Syst. 9 , 1–2. https://doi.org/10.1007/s10115-005-0226-5 (2006).

Huang, H., Tornero-Velez, R. & Barzyk, T. M. Associations between socio-demographic characteristics and chemical concentrations contributing to cumulative exposures in the United States. J. Exposure Sci. Environ. Epidemiol. 27 , 544–550. https://doi.org/10.1038/jes.2017.15 (2017).

Kadi, N. & Khelfaoui, M. Population density, a factor in the spread of COVID-19 in Algeria: Statistic study. Bull. Natl. Res. Centre 44 , 1–7. https://doi.org/10.1186/s42269-020-00393-x (2020).

Bhadra, A., Mukherjee, A. & Sarkar, K. Impact of population density on COVID-19 infected and mortality rate in India. Model. Earth Syst. Environ. 7 , 623–629. https://doi.org/10.1007/s40808-020-00984-7 (2021).

Feng, S. et al. Rational use of face masks in the COVID-19 pandemic. Lancet Respirat. Med. 8 , 434–436. https://doi.org/10.1016/S2213-2600(20)30134-X (2020).

Sen, S., Karaca-Mandic, P. & Georgiou, A. Association of stay-at-home orders with COVID-19 hospitalizations in 4 states. JAMA 323 , 2522–2524. https://doi.org/10.1001/jama.2020.9176 (2020).

Shimada, K., Hirasawa, K. & Hu, J. Class association rule mining with chi-squared test using genetic network programming. In 2006 IEEE International Conference on Systems, Man and Cybernetics , vol. 6, pp. 5338–5344, https://doi.org/10.1109/ICSMC.2006.385157 (IEEE, 2006).

Alvarez, S. A. Chi-squared computation for association rules: Preliminary results. Boston, MA: Boston College 13 (2003).

Schuchat, A. & CDC COVID-19 Response Team. Public health response to the initiation and spread of pandemic COVID-19 in the United States, February 24–April 21, 2020. https://doi.org/10.15585/mmwr.mm6918e2 (2020).

Markowitz, A. State-by-state guide to face mask requirements. AARP. Retrieved online on February 10 (2021). Available at https://gtxcorp.com/aarp-com-state-by-state-guide-to-face-mask-requirements .

Masks4All. What US states require masks in public? https://masks4all.co/what-states-require-masks/ (2021). (Date of Access: 2021-01-20).

Katragadda, S. Github: Association Mining - Data Collection and Preprocessing (2021). URL https://github.com/raviteja-bhupatiraju/AssociationMining_COVID19 .

Ballotpedia. State government responses to the coronavirus. https://ballotpedia.org/State_government_responses_to_the_coronavirus_(COVID-19)_pandemic,_2020 (2020). (Date of Access: 2020-12-25).

The Food Industry Association. COVID-19 - state reopening plans. https://www.fmi.org/blog/view/state-affairs-issue-papers/2020/12/08/covid-19---state-reopening-plans (2020). (Date of Access: 2020-12-25).

Kang, Y. et al. Multiscale dynamic human mobility flow dataset in the US during the COVID-19 epidemic. Sci. data 7 , 1–13. https://doi.org/10.1038/s41597-020-00734-5 (2020).

Pan, Y. et al. Quantifying human mobility behaviour changes during the COVID-19 outbreak in the United States. Sci. Rep. 10 , 1–9. https://doi.org/10.1038/s41598-020-77751-2 (2020).

Warren, M. S. & Skillman, S. W. Mobility changes in response to COVID-19. arXiv (2020). Preprint available at https://arxiv.org/abs/2003.14228 .

World Population Review. United States by density 2021. https://worldpopulationreview.com/state-rankings/state-densities (2020). (Date of Access: 2020-12-25).

Johns Hopkins University. Coronavirus Resource Center. https://coronavirus.jhu.edu/ (2020). (Date of Access: 2021-01-15).

Batista, M. Estimation of the final size of the second phase of coronavirus epidemic by the logistic model (2020). Preprint at https://doi.org/10.1101/2020.03.11.20024901 .

Wu, K., Darcet, D., Wang, Q. & Sornette, D. Generalized logistic growth modeling of the COVID-19 outbreak: Comparing the dynamics in the 29 provinces in China and in the rest of the world. Nonlinear Dyn. 101 , 1561–1581. https://doi.org/10.1007/s11071-020-05862-6 (2020).

Kokoska, S. & Nevison, C. Critical values for the chi-square distribution. In Statistical Tables and Formulae , pp. 58–59, https://doi.org/10.1007/978-1-4613-9629-1\_9 (Springer, 1989).

García, M. N. M., Román, I. R., Peñalvo, F. J. G. & Bonilla, M. T. An association rule mining method for estimating the impact of project management policies on software quality, development time and effort. Exp. Syst. Appl. 34 , 522–529. https://doi.org/10.1016/j.eswa.2006.09.022 (2008).

Download references


This research was partially funded by NSF Grants CNS-1650551, CNS-2027688, and CNS-1429526.

Author information

Authors and affiliations.

Informatics Research Institute, University of Louisiana at Lafayette, Lafayette, 70506, USA

Satya Katragadda, Raju Gottumukkala, Ravi Teja Bhupatiraju, Azmyin Md. Kamal, Vijay Raghavan, Henry Chu, Ramesh Kolluru & Ziad Ashkar

You can also search for this author in PubMed   Google Scholar


Problem description: R.G. and R.K.; Conceptualization: R.G.; Methodology: S.K.; Software: S.K. and A.K.; Validation: S.K. and R.G.; Formal analysis: S.K., V.R., and R.G.; Investigation: R.B. and Z.A.; Resources: R.G.; Data curation: R.B., S.K. and R.G.; Writing–original draft preparation: S.K., R.G., and R.B.; Writing–review and editing: Z.A. and V.R.; Visualization: R.B.; Supervision: R.G.; Project administration: R.G.; Funding acquisition: R.G., V.R., H.C., and R.K. All authors reviewed the manuscript.

Corresponding author

Correspondence to Raju Gottumukkala .

Ethics declarations

Competing interests.

The author declares no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Katragadda, S., Gottumukkala, R., Bhupatiraju, R.T. et al. Association mining based approach to analyze COVID-19 response and case growth in the United States. Sci Rep 11 , 18635 (2021). https://doi.org/10.1038/s41598-021-96912-5

Download citation

Received : 16 March 2021

Accepted : 18 August 2021

Published : 20 September 2021

DOI : https://doi.org/10.1038/s41598-021-96912-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Forecasting covid-19 cases using time series modeling and association rule mining.

  • Rachasak Somyanonthanakul
  • Kritsasith Warin
  • Siriwan Suebnukarn

BMC Medical Research Methodology (2022)

Examining the COVID-19 case growth rate due to visitor vs. local mobility in the United States using machine learning

  • Satya Katragadda
  • Ravi Teja Bhupatiraju
  • Raju Gottumukkala

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

covid 19 data mining research paper

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients' Recovery


  • 1 Department of Mathematics and Computer Science, Faculty of Science, Federal University of Kashere, P.M.B. 0182, Gombe, Nigeria.
  • 2 Department of Computer Science and Engineering, Khulna University of Engineering & Technology, Khulna, 9203 Bangladesh.
  • 3 Department of Biological Sciences, Faculty of Science, Federal University of Kashere, P.M.B. 0182, Gombe, Nigeria.
  • PMID: 33063049
  • PMCID: PMC7306186
  • DOI: 10.1007/s42979-020-00216-w

Novel coronavirus (COVID-19 or 2019-nCoV) pandemic has neither clinically proven vaccine nor drugs; however, its patients are recovering with the aid of antibiotic medications, anti-viral drugs, and chloroquine as well as vitamin C supplementation. It is now evident that the world needs a speedy and quicker solution to contain and tackle the further spread of COVID-19 across the world with the aid of non-clinical approaches such as data mining approaches, augmented intelligence and other artificial intelligence techniques so as to mitigate the huge burden on the healthcare system while providing the best possible means for patients' diagnosis and prognosis of the 2019-nCoV pandemic effectively. In this study, data mining models were developed for the prediction of COVID-19 infected patients' recovery using epidemiological dataset of COVID-19 patients of South Korea. The decision tree, support vector machine, naive Bayes, logistic regression, random forest, and K-nearest neighbor algorithms were applied directly on the dataset using python programming language to develop the models. The model predicted a minimum and maximum number of days for COVID-19 patients to recover from the virus, the age group of patients who are of high risk not to recover from the COVID-19 pandemic, those who are likely to recover and those who might be likely to recover quickly from COVID-19 pandemic. The results of the present study have shown that the model developed with decision tree data mining algorithm is more efficient to predict the possibility of recovery of the infected patients from COVID-19 pandemic with the overall accuracy of 99.85% which stands to be the best model developed among the models developed with other algorithms including support vector machine, naive Bayes, logistic regression, random forest, and K-nearest neighbor.

Keywords: COVID-19; Coronavirus; Data mining; Decision tree; Pandemic; Patients’ recovery.

© Springer Nature Singapore Pte Ltd 2020.

PubMed Disclaimer

Conflict of interest statement

Conflict of interestAuthors have declared that no conflict of interest exists.

Frequency of sex attribute

Frequency of age attribute

Frequency of infection_case attribute

Frequency of no_days attribute

Frequency of state attribute

Decision Tree model for COVID-19…

Decision Tree model for COVID-19 infectedpatients’ recovery

Performance evaluation results of the…

Performance evaluation results of the models

Similar articles

  • Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Muhammad LJ, et al. SN Comput Sci. 2021;2(1):11. doi: 10.1007/s42979-020-00394-7. Epub 2020 Nov 27. SN Comput Sci. 2021. PMID: 33263111 Free PMC article.
  • Supervised Machine Learning Approach to COVID-19 Detection Based on Clinical Data. Yazdani A, Zahmatkeshan M, Ravangard R, Sharifian R, Shirdeli M. Yazdani A, et al. Med J Islam Repub Iran. 2022 Sep 24;36:110. doi: 10.47176/mjiri.36.110. eCollection 2022. Med J Islam Repub Iran. 2022. PMID: 36447543 Free PMC article.
  • Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Pourhomayoun M, Shakibi M. Pourhomayoun M, et al. Smart Health (Amst). 2021 Apr;20:100178. doi: 10.1016/j.smhl.2020.100178. Epub 2021 Jan 16. Smart Health (Amst). 2021. PMID: 33521226 Free PMC article.
  • A sustainable advanced artificial intelligence-based framework for analysis of COVID-19 spread. Ahmad M, Ahmed I, Jeon G. Ahmad M, et al. Environ Dev Sustain. 2022 Aug 16:1-16. doi: 10.1007/s10668-022-02584-0. Online ahead of print. Environ Dev Sustain. 2022. PMID: 35993085 Free PMC article.
  • Using Data Mining Techniques to Predict Chronic Kidney Disease: A Review Study. Sattari M, Mohammadi M. Sattari M, et al. Int J Prev Med. 2023 Aug 28;14:110. doi: 10.4103/ijpvm.ijpvm_482_21. eCollection 2023. Int J Prev Med. 2023. PMID: 37855011 Free PMC article. Review.
  • iCovidCare: Intelligent health monitoring framework for COVID-19 using ensemble random forest in edge networks. Adhikari M, Munusamy A. Adhikari M, et al. Internet Things (Amst). 2021 Jun;14:100385. doi: 10.1016/j.iot.2021.100385. Epub 2021 Mar 10. Internet Things (Amst). 2021. PMID: 38620813 Free PMC article.
  • An Improved Long Short-Term Memory Algorithm for Cardiovascular Disease Prediction. Revathi TK, Balasubramaniam S, Sureshkumar V, Dhanasekaran S. Revathi TK, et al. Diagnostics (Basel). 2024 Jan 23;14(3):239. doi: 10.3390/diagnostics14030239. Diagnostics (Basel). 2024. PMID: 38337755 Free PMC article.
  • Digital health technology combining wearable gait sensors and machine learning improve the accuracy in prediction of frailty. Fan S, Ye J, Xu Q, Peng R, Hu B, Pei Z, Yang Z, Xu F. Fan S, et al. Front Public Health. 2023 Jul 20;11:1169083. doi: 10.3389/fpubh.2023.1169083. eCollection 2023. Front Public Health. 2023. PMID: 37546315 Free PMC article.
  • Analysis of COVID-19 Death Cases Using Machine Learning. Aslam H, Biswas S. Aslam H, et al. SN Comput Sci. 2023;4(4):403. doi: 10.1007/s42979-023-01835-9. Epub 2023 May 17. SN Comput Sci. 2023. PMID: 37220559 Free PMC article.
  • Investigating the performance of machine learning algorithms in predicting the survival of COVID-19 patients: A cross section study of Iran. Yazdani A, Bigdeli SK, Zahmatkeshan M. Yazdani A, et al. Health Sci Rep. 2023 Apr 13;6(4):e1212. doi: 10.1002/hsr2.1212. eCollection 2023 Apr. Health Sci Rep. 2023. PMID: 37064314 Free PMC article.
  • Al-Turaiki I, Alshahrani M, Almutairi T. Building predictive models for MERS-CoV infections using data mining techniques. J Infect Public Health. 2016;9:744–748. doi: 10.1016/j.jiph.2016.09.007. - DOI - PMC - PubMed
  • Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression (PDF) Am Stat. 1992;46(3):175–185.
  • Coronavirus dataset of Korea Centers for Disease Control & Prevention (KCDC). https://www.kaggle.com/kimjihoo/coronavirusdataset/data . Accessed 20 Apr 2020
  • Everitt BS, et al. Miscellaneous clustering methods in cluster analysis. 5. Chichester: Wiley; 2011.
  • Gandhi R. Naive Bayes classifier, towards data science. 2018. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c . Accessed 25 Apr 2020.

Related information

Linkout - more resources, full text sources.

  • Europe PubMed Central
  • PubMed Central
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.


  1. Predicting the incidence of COVID-19 using data mining

    In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease. The COVID-19 datasets provided by Johns Hopkins University, contain information on COVID-19 cases in different geographic regions since January 22, 2020 and are updated daily.

  2. Predicting the incidence of COVID-19 using data mining - PMC

    The high prevalence of COVID-19 has made it a new pandemic. Predicting both its prevalence and incidence throughout the world is crucial to help health professionals make key decisions. In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease.

  3. Insights from the COVID-19 Pandemic: A Survey of Data Mining ...

    Overall, the COVID-19 pandemic has highlighted the importance of data mining techniques in analyzing large volumes of data in real-time, integrating data from multiple sources, developing predictive models, ensuring data quality, and considering ethical considerations.

  4. Data based model for predicting COVID-19 morbidity and ...

    We aimed to predict the evolution of COVID-19 in metropolises and identify air quality and meteorological variables correlated with confirmed cases and deaths.

  5. Research and Data Mining During the COVID-19 Pandemic

    Data mining efforts aim to formulate, analyze and implement basic induction processes that facilitate the extraction of meaningful information and knowledge from unstructured data. Data mining extracts patterns, changes, associations and anomalies from large data sets.

  6. Data mining and analysis of scientific research data records ...

    Results. From historical analysis of scientific data records on viruses, pandemics and mortality, we identified that Chinese universities have not been leading on these topics historically. However, during the early stages of the Covid-19 pandemic, the Chinese universities are strongly dominating the research on these topics.

  7. Association mining based approach to analyze COVID-19 ...

    This paper introduces a novel data mining-based approach to understand the effects of different non-pharmaceutical interventions in containing the COVID-19 infection rate.

  8. Predictive Data Mining Models for Novel Coronavirus (COVID-19 ...

    In this study, data mining models were developed for the prediction of COVID-19 infected patients' recovery using epidemiological dataset of COVID-19 patients of South Korea. The decision tree, support vector machine, naive Bayes, logistic regression, random forest, and K-nearest neighbor algorithms were applied directly on the dataset using ...

  9. High-Performance Mining of COVID-19 Open Research Datasets ...

    Print on Demand (PoD) ISBN: 978-1-6654-1563-7. INSPEC Accession Number: Persistent Link: https://ieeexplore.ieee.org/servlet/opac?punumber=9302766. More » Publisher: IEEE. The COVID-19 global pandemic is an unprecedented health crisis. Many researchers around the world have produced an extensive collection of literature since the.

  10. Data science approaches to confronting the COVID-19 pandemic ...

    In this paper, we review the newly born data science approaches to confronting COVID-19, including the estimation of epidemiological parameters, digital contact tracing, diagnosis, policy-making, resource allocation, risk assessment, mental health surveillance, social media analytics, drug repurposing and drug development.