Corona-virus disease (COVID-19) Data-set with Improved Measurement Errors of Referenced Official Data Sources
收藏Mendeley Data2020-05-04 更新2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/nw5m4hs3jr/2
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is the result of a study on the quality of official datasets available for COVID-19. We used comparative statistical analysis to evaluate the accuracy of data collection by a national (Chinese Center for Disease Control and Prevention) and two international (World Health Organization; European Centre for Disease Prevention and Control) organisations based on the value of systematic measurement errors. The data is collected by using text mining techniques and reviewing reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, standard country codes (M49 code), Alpha-2 codes, Alpha-3 codes, latitude, longitude, and some additional attributes such as population. The data of China is presented in more detail in another sheet, which is extracted from the attached reports to the main page of the CCDC website. Additionally, it is beneficiary of major corrections on the referenced data-sets and official reports such as adjustment of the date of reports (which was suffering from one or two days lags), removing four negative values, detecting unreasonable changes of historical data in new reports (which was revealed by comparing the daily reports), and finally the corrections on systematic measurement errors, (which was increased by the increase of the number of infected countries). An aggregated root mean square error was used to identify the main problematic parts of data-sets in addition to comparative statistical analysis to evaluate the errors. The result is a combined dataset with improved systematic measurement errors and with some new attributes in addition to the normal attributes of SARS-CoV-2 and cronavirus disease, such as daily mortality, and fatality rates. This data-set could be considered as a comprehensive and reliable source of COVID-19 data for further studies.
本数据集为一项针对新冠疫情官方公开数据集质量的研究成果。本研究采用比较统计分析方法,基于系统测量误差值,评估了1家国家级机构——中国疾病预防控制中心(Chinese Center for Disease Control and Prevention),以及2家国际机构——世界卫生组织(World Health Organization)、欧洲疾病预防控制中心(European Centre for Disease Prevention and Control)的数据采集准确性。数据集通过文本挖掘技术,并结合报告、元数据与参考数据的审阅工作完成采集。合并后的数据集包含完整的空间数据,如国家面积、标准国家代码(M49 code)、Alpha-2代码(Alpha-2 codes)、Alpha-3代码(Alpha-3 codes)、纬度、经度,以及人口等若干附加属性。中国相关数据以更细致的形式收录于另一工作表中,该部分数据取自中国疾病预防控制中心官网主页附带的报告。此外,本数据集还得益于参考数据集与官方报告的多项重大修正:包括调整存在1至2天滞后的报告日期、移除4个负值、通过比对每日报告识别新报告中历史数据的不合理变动,以及修正随受感染国家数量增加而增大的系统测量误差。除采用比较统计分析方法评估误差外,本研究还通过聚合均方根误差(root mean square error)定位数据集的主要问题区域。最终得到的合并数据集不仅优化了系统测量误差,还在严重急性呼吸综合征冠状病毒2型(SARS-CoV-2)与冠状病毒病(coronavirus disease)的常规属性基础上,新增了每日死亡率、病死率等新属性。本数据集可作为开展后续新冠疫情相关研究的全面且可靠的数据来源。
创建时间:
2020-05-04



