This repository contains code and data necessary to replicate the findings of our paper [INSERT arXiv CITATION].
Scripts in this repository are written in R, Python, and Stata. Note that you will need a Stata license to fully replicate the analysis. Throughout this Readme, it is assumed that you’ll execute scripts from the repo root directory. In addition, we assume that you have an environment of Python and R packages described in <environment.yml>.
To create and activate this environment using conda execute the following lines (Note: RStudio is commented out of the environment, because it causes dependency clashes in a Windows environment. If you are not in Windows, and would like to use the RStudio app, feel free to uncomment it before creating the environment):
conda env create -f environment.yml
conda activate gpl-covid
Once you have activated this environment, to run some of the Python scripts, you’ll need to install the small package (1 module) that is included in this repo. Ensure you are currently located inside the repo root directory (cd gpl-covid
), then execute
pip install -e .
To run one of the scripts, you will also need an API key for the US Census API, which can be obtained here. You will need to save this key to api_keys.json
in the root directory of this repo with the following format:
{
"census": "API_KEY_STRING"
}
Finally, to estimate the regression models, you will need several package installed in Stata. To add them, launch Stata and run:
ssc install reghdfe, replace
ssc install ftools, replace # the latest version of reghdfe would also require the installation of ftools
ssc install coefplot, replace
ssc install filelist, replace
codes
├── data
│ ├── china
│ │ ├── collate_data.py
│ │ └── download_and_clean_JHU_china.R
│ ├── cutoff_dates.csv
│ ├── france
│ │ ├── download_and_clean_JHU_france.R
│ │ ├── format_infected.do
│ │ └── format_policy.do
│ ├── iran
│ │ ├── download_and_clean_JHU_iran.R
│ │ ├── iran-split-interim-into-processed.py
│ │ └── iran_cleaning.R
│ ├── italy
│ │ ├── download_and_clean_JHU_italy.R
│ │ └── italy-download-cases-merge-policies.py
│ ├── korea
│ │ ├── download_and_clean_JHU_korea.R
│ │ ├── generate_KOR_processed.R
│ │ └── make_JHU_comparison_data.R
│ ├── multi_country
│ │ ├── download_6_countries_JHU.R
│ │ ├── get_JHU_country_data.R
│ │ ├── get_adm_info.py
│ │ └── quality-check-processed-datasets.py
│ └── usa
│ ├── add_testing_regimes_to_covidtrackingdotcom_data.ipynb
│ ├── check_health_data.R
│ ├── download_and_clean_JHU_usa.R
│ ├── download_and_clean_usafacts.R
│ ├── download_latest_covidtrackingdotcom_data.py
│ ├── get_usafacts_data.R
│ └── merge_policy_and_cases.py
├── models
│ ├── CHN_create_CBs.R
│ ├── CHN_generate_data_and_model_projection.R
│ ├── FRA_create_CBs.R
│ ├── FRA_generate_data_and_model_projection.R
│ ├── IRN_create_CBs.R
│ ├── IRN_generate_data_and_model_projection.R
│ ├── ITA_create_CBs.R
│ ├── ITA_generate_data_and_model_projection.R
│ ├── KOR_create_CBs.R
│ ├── KOR_generate_data_and_model_projection.R
│ ├── USA_create_CBs.R
│ ├── USA_generate_data_and_model_projection.R
│ ├── alt_growth_rates
│ │ ├── CHN_adm2.do
│ │ ├── FRA_adm1.do
│ │ ├── IRN_adm1.do
│ │ ├── ITA_adm2.do
│ │ ├── KOR_adm1.do
│ │ ├── MASTER_run_all_reg.do
│ │ └── USA_adm1.do
│ ├── get_gamma.py
│ ├── predict_felm.R
│ ├── projection_helper_functions.R
│ └── run_all_CB_simulations.R
├── plotting
│ ├── count-policies.py
│ ├── examine_lagged_relationship_between_new_deaths_recoveries_and_older_cases.R
│ ├── fig1.R
│ ├── fig2.R
│ ├── fig4_analysis.py
│ ├── figA2.py
│ └── gen_fig4.py
├── pop.py
└── utils.py
A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found here. This is a live document that may be updated as additional data becomes available. For a version that is fixed at the time this manuscript was submitted, please see the link to our paper at the top of this README.
There are four stages to our analysis:
- Data collection and processing
- Regression model estimation
- SIR model projections
- Figure creation
The steps to obtain all data in <data/raw>, and then process this data into datasets that can be ingested into a regression, are described below. Note that some of the data collection was performed through manual downloading and/or processing of datasets and is described in as much detail as possible. The sections should be run in the order listed, as some files from later sections will depend on those from earlier sections (e.g. the geographical and population data).
For detailed information on the manual collection of policy, epidemiological, and population information, see the up-to-date version of our paper’s Appendix. A version that was frozen at the time of submission is available with the article cited at the top of this README. Our epidemiological and policy data sources for all countries are listed here, with a more frequently updated version here.
-
python codes/data/multi_country/get_adm_info.py
: Generates shapefiles and csvs with administrative unit names, geographies, and populations (most countries). Note: To run this script, you will need a U.S. Census API key. See Setup -
For Chinese city-level population data, the dataset is extracted from a compiled dataset of the 2010 Chinese City Statistical Yearbooks. We manually matched the city level population dataset to the city level COVID-19 epidemiology dataset. The resulting file is in data/raw/china/china_city_pop.csv.
-
For Korean population data, download from Statistics Korea (a similar page in English is available here)
a. Click the
ITEM
tab and check thePopulation
box only.b. Click the
By Administrative District
tab and check1 Level Select all
.c. Click the
By Age Group
tab and check theTotal
box only.d. Click the
Time
tab and checkMonthly
and2020.02
.e. Click the green
Search
button on the upper right of the window.f. Click the blue
Download
button under theSearch
button.g. Select
CSV
asFile format
and download.h. Open this file, remove the top two rows and the second column. Then change the header (the top row to
adm1_name, population
). Save to data/interim/korea/KOR_population.csv.
Most policy and testing data was manually collected from a variety of sources. A mapping was developed from each policy to one of the variables we encode for our regression. These sources and mappings are listed in a csv for each country following the pattern data/raw/[country_name]/[country_name]_policy_data_sources.csv
.
Any policy/testing data that was scraped programmatically is formatted similar to the manual data sheet and saved to data/interim/[country_name]/[country_name]_policy_data_sources_other.csv
. These programmatic steps are listed below:
python codes/data/usa/download_latest_covidtrackingdotcom_data.py
: Downloads testing regime data. Note: It seems this site has been getting high traffic and frequently fails to process requests. If this script throws an error due to that issue, try again later.jupyter nbconvert --ExecutePreprocessor.timeout=None --ExecutePreprocessor.kernel_name=python3 --execute codes/data/usa/add_testing_regimes_to_covidtrackingdotcom_data.ipynb
: Check that detected testing regime changes make sense and discard any false detections (it is in a notebook so you should manually check the detected changes, but you may run it directly using our choices with the above command).
Rscript codes/data/multi_country/download_6_countries_JHU.R
: Downloads 6 countries' data from the Johns Hopkins University Data underlying their dashboard. Note: The JHU dataset format has been changing frequently, so it is possible that this script will need to be modified.
- For data from January 24, 2020 onwards, we relied on an open source GitHub project. Download the data and save it to data/raw/china/DXYArea.csv.
- For data before January 24, 2020, we manually collected data, the file is in data/raw/china/china_city_health_jan.xlsx.
- Download the March 12 file update for the number of confirmed cases per région from the French government’s website and save it to data/raw/france/fr-sars-cov-2-20200312.xlsx. This file only gets updated every 1-5 days, so we augment it with data scraped daily from a live website through March 25, 2020. At this point, the live website stopped reporting daily infections, and we're currently working to figure out if this periodically updated site will continue to produce updates.
stata -b do codes/data/france/format_infected.do
: Run in Stata to clean and format the French regional epidemiological dataset, set at the beginning the last sample date. Default is March 18th.
-
Copy all of the date lines from the "New COVID-19 cases in Iran by province" table on the Wikipedia page tracking this outbreak in Iran.
-
Open the excel file data/raw/iran/covid_iran.xlsx. This file contains the first step cleaning template for the cases data, as well as the information on key policy actions taken by the Iranian government. The tabs in this file are:
a.
200314_cases_raw
: A template into which raw data from the Wikipedia table -- see (1) -- should be pasted.b.
cases_cleaned--to_csv
: A cleaned column format for the intermediate cases data. Simply extend the formulas (by copy/paste) in each row so that each row of the raw data is included. Do not change the column headings. Once all raw data has been included, save this tab to a csv file and save in data/interim/iran/covid_iran_cases.csv.c.
200314_policies--to_csv
: A list of the key policies Iran implemented to combat the coronavirus, and sources. A copy of the information in this tab is saved as data/interim/iran/covid_iran_policies.csv. To update data with future policy changes, update this tab with the relevant information and replace the csv with a new copy of this tab.
Epi data is downloaded and merged with policy data in one step, described in the following section
- Korean epi data were manually collected from various Korean provincial websites. Note that these provinces often report the data in different formats (e.g. pdf attachments, interactive dashboards) and usually do not have English translations. For more details on how we collected the data, please refer to the Data Acquisition and Processing section in the appendix. This data is saved in data/interim/korea/KOR_health.csv.
Run the following scripts to merge epi, policy, testing, and population data for each country. After completion, you may run codes/data/multi_country/quality-check-processed-datasets.py, to make sure all of the fully processed datasets are correctly and consistently formatted.
python codes/data/china/collate_data.py
stata -b do codes/data/france/format_policy.do
Rscript codes/data/iran/iran_cleaning.R
python codes/data/iran/iran-split-interim-into-processed.py
python codes/data/italy/italy-download-cases-merge-policies.py
Rscript codes/data/korea/generate_KOR_processed.R
python codes/data/usa/merge_policy_and_cases.py
: Merge all US data. This outputs data/processed/adm1/USA_processed.csv.- (optional)
Rscript codes/data/usa/check_health_data.R
: Confirm that known data quality issues have been dealt with.
Once data is obtained and processed, you can estimate regression models for each country using the following command:
stata -b do codes/models/alt_growth_rates/MASTER_run_all_reg.do
Each of the individual country regressions are available to be run within codes/models/alt_growth_rates.
Once the regression coefficients have been estimated in the above models, run the following code to generate projections of active and cumulative infections using an SIR model:
python codes/models/get_gamma.py
: Estimates removal rate to use in projections from data that contains both cumulative cases and active cases.Rscript codes/models/run_all_CB_simulations.R
: Generates the csv inputs for Figure 4.
To generate the four figures in the paper, run the following scripts. Figure 1 only requires the data collection steps to be complete. Figures 2 and 3 require the regression step to be complete, and Figure 4 requires the projection step to be complete.
Rscript codes/plotting/fig1.R
: Generates 12 outputs that constitute Figure 1 (*_timeseries.pdf
, and *_map.pdf
for each of the 6 countries). Note: This script requires data/raw/china/match_china_city_name_w_adm2.csv, a manually generated crosswalk of Chinese city names.
Rscript codes/plotting/fig2.R
: Generates 9 outputs that constitute Figure 2, in results/figures/fig2
:
- A1:
Fig2_nopolicy.pdf
: Main figure for Panel A - A2:
Fig2_effectsize_nopolicy.pdf
: Effect size values for Panel A - A3:
Fig2_growth_nopolicy.pdf
: Effect size as percent growth values for Panel A - B1:
Fig2_comb.pdf
: Main figure for Panel B - B2:
Fig2_effectsize_comb.pdf
: Effect size values for Panel B - B3:
Fig2_growth_comb.pdf
: Effect size as percent growth values for Panel B - C1:
Fig2_ind.pdf
: Main figure for Panel C - C2:
Fig2_effectsize_ind.pdf
: Effect size values for Panel C - C3:
Fig2_growth_ind.pdf
: Effect size as percent growth values for Panel C
Figure 3 is generated by the regression estimation step (codes/models/alt_growth_rates/MASTER_run_all_reg.do
).
Note that the outputs of codes/plotting/fig1.R are required for Fig 4 as well.
- (if not already generated)
Rscript codes/plotting/fig1.R)
: Generate the cases data python codes/plotting/gen_fig4.py
: Generate Figure 4.python codes/plotting/fig4_analysis.py
: Generate a printout of numerical results from the projections for each country.
python codes/plotting/count-policies.py
Figure A1 is generated by the regression estimation step (codes/models/alt_growth_rates/MASTER_run_all_reg.do
). The final output file is figures/appendix/ALL_conf_cases_e.png
. Please note that if you're running the Stata console in Unix, .png file formats are not supported and you would need to change the final format in line 50 of codes/models/alt_growth_rates/MASTER_run_all_reg.do
from .png to either .eps or .ps. For more information on supported file formats while using the graph export
command on different operating systems, please click here.
Rscript codes/data/korea/make_JHU_comparison_data.R
: Creates data/interim/korea/KOR_JHU_data_comparison.csvpython codes/plotting/figA2.py
: Generates 2 outputs that constitute Figure A2 (results/figures/appendix/figA2-1.pdf
andresults/figures/appendix/figA2-2.pdf
)