allenai/cord19
收藏数据集卡片:CORD-19
数据集描述
数据集概述
CORD-19是一个关于COVID-19及相关冠状病毒研究的学术论文语料库,由Semantic Scholar团队在Allen Institute for AI维护,旨在支持文本挖掘和NLP研究。
支持的任务和排行榜
请参阅相关Kaggle挑战中定义的任务。
语言
该数据集为英语(en)。
数据集结构
数据实例
以下代码块展示了JSON格式的一个样本概览(由于某些字段非常长,因此进行了缩写): json { "abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified [...]", "authors": "Madani, Tariq A; Al-Ghamdi, Aisha A", "cord_uid": "ug7v899j", "doc_embeddings": [ -2.939983606338501, -6.312200546264648, -1.0459030866622925, [...] 766 values in total [...] -4.107113361358643, -3.8174145221710205, 1.8976187705993652, 5.811529159545898, -2.9323840141296387 ], "doi": "10.1186/1471-2334-1-6", "journal": "BMC Infect Dis", "publish_time": "2001-07-04", "sha": "d1aafb70c066a2068b02786f8929fd9c900897fb", "source_x": "PMC", "title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia", "url": "https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/" }
数据字段
目前仅集成以下字段:cord_uid, sha, source_x, title, doi, abstract, publish_time, authors, journal。在fulltext配置中,pdf_json_files中的章节被转换为fulltext特征。
cord_uid: 为每个CORD-19论文分配的唯一标识符,类型为str。sha: 与CORD-19论文关联的所有PDF的SHA1,类型为List[str]。source_x: 接收论文的来源名称,类型为List[str]。title: 论文标题,类型为str。doi: 论文DOI,类型为str。abstract: 论文摘要,类型为str。publish_time: 论文发表日期,类型为str,格式为yyyy-mm-dd。authors: 论文作者,类型为List[str]。journal: 论文期刊,类型为str。
额外字段基于加载时的选择配置:
fulltext: 从JSON(从PDF提取)中所有文本章节的串联,类型为str。doc_embeddings: 文档嵌入,类型为sequence,包含浮点值元素的向量。
数据分割
由于该数据集未提供注释,所有实例均在训练分割中提供。
各配置的大小如下:
| train | |
|---|---|
| metadata | 368618 |
| fulltext | 368618 |
| embeddings | 368618 |
数据集创建
策划理由
请参阅官方README。
源数据
请参阅官方README。
注释
无注释。
使用数据的注意事项
数据集的社会影响
[更多信息需补充]
偏见的讨论
[更多信息需补充]
其他已知限制
[更多信息需补充]
附加信息
数据集策展人
[更多信息需补充]
许可信息
[更多信息需补充]
引用信息
@article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }
贡献
感谢@ggdupont添加此数据集。



