Investigating COVID-19's global evolution factors

COVID-19 has already affected almost all countries of the world, with national goverments being exposed at an unprecedented situation, which creates such massive health and economic shocks that the world hasn't seen since the World War II. The intensity of the pandemic varies between countries, there are different goverment preventive policies and the reaction of alternative population regions, is not consistent. PREDICTA's scientific team took the challenge to collect and analyze the relevant publicly available data aiming to identify the most important factors that significantly affect the pandemic's evolution & spread, through the application of extensive statistical data analysis. The findings and the results of this project are considered worth of publicity and attention, as they may contribute to the wider global research, with the intention to better understand the situation created and to help optimize preventive policies in the present and possible future pandemics.


The project reveals significant findings:


Men have equal probabilities of getting infected as Women.


Men have about ~40% higher risk of dying.

Frequent flight connections between China, USA and Western Europe was the original cause of virus widespread.

High percentage of the elderly age segment and increased number of nursery homes led to increased mortality.

Early imposement of restriction measures led to decreased citizens mobility and therefore to drastic limitation of the virus spread.

Once lockdown measures were imposed, ~22 days elapsed until the reproduction rate fell below 1 in Western Europe and ~38 days in Latin America.

Increased temperatures minimize infection, as extensive outdoor mobility is encouraged.

Poverty and informal employment are significant spread accelerators in developing countries.

Indoor recreation venues are considered superspreaders, while strolling in the nature is considered to be extremely safe.

Lengthy indoor exposure increases infection, while mask usage and ventilation prevent the virus spread.


The Project's Methodological Approach

A - Data Collection - Retrieval

One challenging task was the detection of the right and most informational data sources, the type of data required and the extent of their availability for each country of the world. Data were collected from Official Health organizations like WHO - ECDC, UK - US Universities, Non-Profit Foundations like Wikipedia and Institutes like the World Bank.

Data collected were grouped in 5 high-level categories:

  • Virus: Patients and Deceased Counts, Demographics, Covid-19 Tests, Government Preventive Measures and Citizens' Response.
  • Health System: Health centers' capabilities and vaccinations' policies.
  • Hazards: Death causes and risks.
  • General: Climate and geographical data.
  • Socioeconomics: Prosperity, human rights and transportation data.

B - Data Management

A Master COVID-19 dataset was constructed and an automated process was developed for its continuous enrichment. Data quality checks were performed before moving further with the analysis. A massive number of KPIs were calculated, enriching the original available information. E.g. KPIs were derived by groupings, ratios, rolling averages, percentages etc., on both day-to-day level and overall country level.

The Master File's information was standardized (by population or/and age), a statistical neccessity that leads to logical comparisons. Comparisons were possible obviously where reliable data were available.

C - Data Exploration

The virus evolution (number of cases and deaths) was examined with respect to two groups of factors, the first group containing socioeconomic and structural country characteristics such as GDP, quality of health system, climate, population density, etc., while the second group contains the goverment intervention measures such as lockdown policies and testing strategy. Factors found significant for the virus evolution were further examined via advanced statistical methods.

D - Statistical Analysis

Statistical Tests and Correlations between several KPIs, related to virus growth and mortality were performed. Factor Analysis was used for data reduction purposes and the creation of indices related to significant factors such as mobility variations and other preventive policy measures. Anomaly Detection was used to identify uncommon patterns that may lead to unreliable data. Cluster Analysis was used to examine the potential grouping of countries based on the pandemic evolution and the various other independent factors found to influence significantly this evolution. Decision trees were also used to possibly clarify complex relations unexplained by the clustering procedure.

E - Simulation Analysis

Simulation of the risk of infection under specific situations/conditions using what-if scenarios. Epidemiological models were applied and compared, to depict and predict the pandemic evolution by country. What-If simulation modeling to understand the pandemic evolution under alternative scenarios specified by the preventive government policies.

Data Category & Relevant Sources

To create the Master COVID-19 file 23 different data sources were finally used after extensive research and evaluation. These sources belong to Official Health organizations like WHO & ECDC, Universities from the UK & the US, Non-Profit Foundations like Wikipedia and Institutes like the World Bank. 213 countries' data were included in the Master File and a total of over 1000 KPIs were found or produced.

Data CategoryContent
VirusInfections, Tests, Demographics, Government Measures, Population Mobility etc.
Health SystemBeds, ICUs, Hospitals, Physicians, Vaccines etc.
HazardsHealth Risk factors, Causes of Deaths, Perceived Health Status, Life Expectancy etc.
GeneralPopulation Density, Age, Climate, Continent etc.
SocioeconomicsFlights, Prosperity Indexes, Enterprizes, Households size etc.
European Centre for Disease Prevention and Control (ECDC)
Organisation for Economic Co-operation & Development (OECD)
United Nations (OCHA)
United Nations (UNDP)
United Nations (DESA)
World Health Organization (WHO)
World Bank Group (WBG)
Central Intelligence Agency (CIA)
University of Oxford
Blavatnik School of Government
University of Washington (GHDx)
Legatum Institute
ACAPS (NonProfit)
OWID (NonProfit)
Institute for Health Metrics and Evaluation (IHME)
International Labour Organization (ILO)
Worldometer (NonProfit)
City Population (NonProfit)
Wikipedia (NonProfit)
Google LLC
National Health Ministries

Data Issues

Infection Data Recording

Asynchronous data recording (data reported at a specific date refer to previous dates) - No common data reporting strategy by source - Non consistent recording of cases/deaths (in hospital recording vs out of hospital).

Testing Strategy

Testing strategy varies by country and additionally is constantly changing, leading to under/overestimation of incidents and mortality. Rapid COVID-19 tests with limited accuracy leading to underestimation of real cases. Very few countries trace and test infected cases' contacts. Some countries report the total number of tests performed, while others report the number of individuals tested, as it is common that the same person is tested more than once.

Reliability / Data Availability

A lot of countries do not publish detailed info of certain groups of data (e.g. demographics, hospitalizations). Developing countries provide limited information and suspicious numbers that cannot be trusted.

Geographical / Population / Cultural Characteristics

Regional differentiations within a country were not taken into account as relevant data cannot be accessed easily. The level of countries' exposure to the global community was difficult to be taken into account.