RESULT: Data repository of public and Open data on COVID

Authors: Alexandra Freeman
Date added: 13th April 2020, 17:31:19

Linked Problems

No linked problems available.


Reviews

No linked reviews available.



Full text

A static copy of the dataset has been uploaded to figshare, which includes a fixed version of the data record at the time of submission, ranging from 1st December 2019 to 5th January 2020. A live version of the data record, which will be continually updated, can be downloaded from (https://github.com/beoutbreakprepared/nCoV2019) or directly from Google Drive: https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGu- JXUJWET008/edit#gid=0 in CSV format, that can be imported it into a variety of software programs. We have also established a Github repository available at: https://github.com/beoutbreakprepared/nCoV2019/covid19 and provide code for importing the data into R statistical software. The epidemiological situation regarding the COVID-19 outbreak is continuously evolving. We therefore have made available an archive data folder through our Github repository where new data is uploaded. Each of the rows represents a single individual case and ID. A description of the fields in the database is shown below and is available through a data dictionary on Github (https://github.com/beoutbreakprepared/nCoV2019/covid19):

ID - Unique identifier for reported case. Currently ID is run concurrently for cases reported from Hubei, China and cases reported outside of Hubei, China. ID order does not necessarily reflect epidemiological progression, or reporting date, and should not be used to order cases in temporal progression.

age - Age of the case reported in years. When not reported, N/A is used. Age ranges are recorded as start_ age-end_age e.g. 50–59.

sex - Sex of the case. When not reported, N/A is used.

city – Initial generic geographic metadata is reported here. Subsequently standardized via lookup with geographic reference table.

province – Initial entry of name of the first administrative division in which the case is reported. Subsequently standardized via lookup with geographic reference table.

country - Name of country in which the case is reported. Note that imported cases will be assigned to the country in which confirmation occurred - this is typically in the arrival country, rather than the site of infection. “Travel_ history_location” will describe other locations of travel for such instances.

wuhan(0)_not_wuhan(1) - Binary flag to distinguish cases from Wuhan, Hubei, China, from all other cases. 0 denotes a case is reported in Wuhan, 1 denotes a case reported elsewhere in the world.

latitude - The latitude of the specific location (denoted as “point” in “geo_resolution”) where the case was reported, or the latitude of a representative location (denoted as “admin” in “geo_resolution”) within the admin- istrative unit the case is reported.

longitude - The longitude of the specific location (denoted as “point” in “geo_resolution”) where the case was reported, or the longitude of a representative location (denoted as “admin” in “geo_resolution”) within the admin- istrative unit the case is reported.

geo_resolution - An indicative field in which the spatial representativeness of “latitude” and “longitude” are described. “point” indicates that a specific location is being represented by these coordinates. “admin” denotes that the coordinates are representative of the administrative unit in which coordinates lie. Subsequent “admin3”, “admin2”, “admin1” and corresponding “admin_id” and “shapefile” will allow for a more specific representation to be had.

date_onset_symptoms - Date when the reported case was recorded to have become symptomatic. Specific dates are reported as DD.MM.YYYY. Ranges are recorded as DD.MM.YYYY - DD.MM.YYYY. Ranges with uncertain start or finish dates are recorded as - DD.MM.YYYY and DD.MM.YYYY - respectively.

date_admission_hospital - Date when the reported case was recorded to have been hospitalized. Specific dates are reported as DD.MM.YYYY. Ranges are recorded as DD.MM.YYYY - DD.MM.YYYY. Ranges with uncertain start or finish dates are recorded as - DD.MM.YYYY and DD.MM.YYYY - respectively.

date_confirmation - Date when the reported case was confirmed as having COVID-19 using rt-PCR. Confirmation accuracy is contingent on the data source used. Specific dates are reported as DD.MM.YYYY.

Ranges are recorded as DD.MM.YYYY - DD.MM.YYYY. Ranges with uncertain start of finish dates are recorded as - DD.MM.YYYY and DD.MM.YYYY - respectively.

symptoms - List of symptoms recorded in the description of the case.

lives_in_Wuhan - Recorded relationship of patient with city of Wuhan, Hubei, China. “yes” indicates that the case was a resident of Wuhan. “no” indicates that the case is not a resident of Wuhan (residential). No information indicates that no data was available.

travel_history_dates - Recorded travel dates to and from Wuhan. Specific dates are reported as DD.MM.YYYY and indicate date when the individual left Wuhan. Ranges are recorded as DD.MM.YYYY - DD.MM.YYYY when both are available. Ranges with uncertain start of finish dates are recorded as - DD.MM.YYYY and DD.MM. YYYY - respectively.

travel_history_location - An open field describing the recent recorded travel history of the case. reported_market_exposure - An open field indicating “yes” if there was reported market exposure and “no” if there was not. N/A indicates that no information is provided.

additional_information - Any additional information that may be informative about the case, such as the occu- pation of the patient, the purpose of their travels, the hospital they were admitted to, etc. 

chronic_disease_binary - 0 represents a case that was reported to have no chronic disease and 1 represents cases that reported a chronic disease

chronic_disease - Reported chronic condition(s) of the reported case.

source - URL identifying the source of this information

sequence_available - If there was a genomic sequence available the accession number is inserted here.outcome - Patients outcome, as either “died” or “discharged” from hospital.date_death_or_discharge - Reported date of death or discharge in DD.MM.YYYY format.

location – Location of the reported case.

admin3 – Administrative unit level 3 (e.g., zip code) of where the case was reported.

admin2 – Administrative unit level 2 (e.g., county) of where the case was reported.

admin1 – Administrative unit level 1 (e.g., province) of where the case was reported.

country_new – Administrative unit level 0 (e.g., country) of where the case was reported.

admin_id – Administrative unit ID of the lowest level available for the case reported.


At time of publication the database contained 18,529 geopositioned records from December 1, 2019 to February 5th, 2020 (Fig. 1). A map of all records can be viewed in real-time here: https://www.healthmap.org/ ncov2019/.

Reference shapefiles are available via ESRI (https://esri.maps.arcgis.com/home/item.html?id=c- 9c26d32bdec4beea7589e303bb06a85 for China admin1, https://esri.maps.arcgis.com/home/item.htm- l?id=0a57592fd41344649f59738e5c330fd3, for China admin2 https://ihme.maps.arcgis.com/home/item. html?id=f3517e223cd544e5a80e9d142caae2b4 for China admin3, https://esri.maps.arcgis.com/home/item. html?id=c8c9696ee6454fb297e36b7dac91481c for Hong Kong, and https://esri.maps.arcgis.com/home/item. html?id=6f76647cf3804e24bd205eb21fccdbc4 for Macau), and GADM (https://gadm.org/data.html for rest of world). All shapefiles have a unique identifier for each component – admin_id should be used to merge the line list data with the relevant shapefiles for a given country, and administrative tier. The admin_id for points refers to the lowest tier of administrative unit reported in columns admin3, admin2, admin1, and country_new. For administrative units themselves, the relevant administrative layer to use is denoted by the geo_resolution column.

Technical Validation

After initial data entry the database was checked using two complementary methodologies to identify possible duplicate records. One was a machine enabled one and the other was done manually by the data curators. The first algorithm proceeds in 5 steps. (1) columns with no variability across all records were removed, (2) the remain- ing data were hashed using a 32-bit variant of MurmurHash3 implemented in the R package FeatureHashing version 0.9.1.38,9, (3) a principle component analysis on the centered, scaled hashed feature matrix is performed for dimension reduction, with principle components having standard deviations greater than 0.5 retained, (4) pairwise, Euclidean distances are then calculated and are normalized based on the smallest observed dis- tance between records, and (5) records that have pairwise distances less than the 0.5th percentile are flagged as duplicates. Duplicate are defined as cases that refer to the same case. Code for these methods is hosted on our GitHub repository (https://github.com/beoutbreakprepared/nCoV2019/covid19). Records identified as possible duplicates were communicated to data curators via Github and flagged in the database. Curators then discussed amongst themselves via an online chat system (www.slack.com) to reach a consensus on how to address the pos- sible duplications.







Related publications

View related publications