cord19
收藏魔搭社区2025-07-04 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/cord19
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for CORD-19
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://www.semanticscholar.org/cord19](https://www.semanticscholar.org/cord19)
- **Repository:** [https://github.com/allenai/cord19](https://github.com/allenai/cord19)
- **Paper:** [CORD-19: The COVID-19 Open Research Dataset](https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1/)
- **Leaderboard:** [Kaggle challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
### Dataset Summary
CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.
### Supported Tasks and Leaderboards
See tasks defined in the related [Kaggle challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks)
### Languages
The dataset is in english (en).
## Dataset Structure
### Data Instances
The following code block present an overview of a sample in json-like syntax (abbreviated since some fields are very long):
```
{
"abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified [...]",
"authors": "Madani, Tariq A; Al-Ghamdi, Aisha A",
"cord_uid": "ug7v899j",
"doc_embeddings": [
-2.939983606338501,
-6.312200546264648,
-1.0459030866622925,
[...] 766 values in total [...]
-4.107113361358643,
-3.8174145221710205,
1.8976187705993652,
5.811529159545898,
-2.9323840141296387
],
"doi": "10.1186/1471-2334-1-6",
"journal": "BMC Infect Dis",
"publish_time": "2001-07-04",
"sha": "d1aafb70c066a2068b02786f8929fd9c900897fb",
"source_x": "PMC",
"title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",
"url": "https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/"
}
```
### Data Fields
Currently only the following fields are integrated: `cord_uid`, `sha`,`source_x`, `title`, `doi`, `abstract`, `publish_time`, `authors`, `journal`. With `fulltext` configuration, the sections transcribed in `pdf_json_files` are converted in `fulltext` feature.
- `cord_uid`: A `str`-valued field that assigns a unique identifier to each CORD-19 paper. This is not necessariy unique per row, which is explained in the FAQs.
- `sha`: A `List[str]`-valued field that is the SHA1 of all PDFs associated with the CORD-19 paper. Most papers will have either zero or one value here (since we either have a PDF or we don't), but some papers will have multiple. For example, the main paper might have supplemental information saved in a separate PDF. Or we might have two separate PDF copies of the same paper. If multiple PDFs exist, their SHA1 will be semicolon-separated (e.g. `'4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0247db5e916c20eae3f6d772e8572eb828236'`)
- `source_x`: A `List[str]`-valued field that is the names of sources from which we received this paper. Also semicolon-separated. For example, `'ArXiv; Elsevier; PMC; WHO'`. There should always be at least one source listed.
- `title`: A `str`-valued field for the paper title
- `doi`: A `str`-valued field for the paper DOI
- `pmcid`: A `str`-valued field for the paper's ID on PubMed Central. Should begin with `PMC` followed by an integer.
- `pubmed_id`: An `int`-valued field for the paper's ID on PubMed.
- `license`: A `str`-valued field with the most permissive license we've found associated with this paper. Possible values include: `'cc0', 'hybrid-oa', 'els-covid', 'no-cc', 'cc-by-nc-sa', 'cc-by', 'gold-oa', 'biorxiv', 'green-oa', 'bronze-oa', 'cc-by-nc', 'medrxiv', 'cc-by-nd', 'arxiv', 'unk', 'cc-by-sa', 'cc-by-nc-nd'`
- `abstract`: A `str`-valued field for the paper's abstract
- `publish_time`: A `str`-valued field for the published date of the paper. This is in `yyyy-mm-dd` format. Not always accurate as some publishers will denote unknown dates with future dates like `yyyy-12-31`
- `authors`: A `List[str]`-valued field for the authors of the paper. Each author name is in `Last, First Middle` format and semicolon-separated.
- `journal`: A `str`-valued field for the paper journal. Strings are not normalized (e.g. `BMJ` and `British Medical Journal` can both exist). Empty string if unknown.
- `mag_id`: Deprecated, but originally an `int`-valued field for the paper as represented in the Microsoft Academic Graph.
- `who_covidence_id`: A `str`-valued field for the ID assigned by the WHO for this paper. Format looks like `#72306`.
- `arxiv_id`: A `str`-valued field for the arXiv ID of this paper.
- `pdf_json_files`: A `List[str]`-valued field containing paths from the root of the current data dump version to the parses of the paper PDFs into JSON format. Multiple paths are semicolon-separated. Example: `document_parses/pdf_json/4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a.json; document_parses/pdf_json/d4f0247db5e916c20eae3f6d772e8572eb828236.json`
- `pmc_json_files`: A `List[str]`-valued field. Same as above, but corresponding to the full text XML files downloaded from PMC, parsed into the same JSON format as above.
- `url`: A `List[str]`-valued field containing all URLs associated with this paper. Semicolon-separated.
- `s2_id`: A `str`-valued field containing the Semantic Scholar ID for this paper. Can be used with the Semantic Scholar API (e.g. `s2_id=9445722` corresponds to `http://api.semanticscholar.org/corpusid:9445722`)
Extra fields based on selected configuration during loading:
- `fulltext`: A `str`-valued field containing the concatenation of all text sections from json (itself extracted from pdf)
- `doc_embeddings`: A `sequence` of float-valued elements containing document embeddings as a vector of floats (parsed from string of values separated by ','). Details on the system used to extract the embeddings are available in: [SPECTER: Document-level Representation Learning using Citation-informed Transformers](https://arxiv.org/abs/2004.07180). TL;DR: it's relying on a BERT model pre-trained on document-level relatedness using the citation graph. The system can be queried through REST (see [public API documentation](https://github.com/allenai/paper-embedding-public-apis)).
### Data Splits
No annotation provided in this dataset so all instances are provided in training split.
The sizes of each configuration are:
| | train |
|------------|-------:|
| metadata | 368618 |
| fulltext | 368618 |
| embeddings | 368618 |
## Dataset Creation
### Curation Rationale
See [official readme](https://github.com/allenai/cord19/blob/master/README.md)
### Source Data
See [official readme](https://github.com/allenai/cord19/blob/master/README.md)
#### Initial Data Collection and Normalization
See [official readme](https://github.com/allenai/cord19/blob/master/README.md)
#### Who are the source language producers?
See [official readme](https://github.com/allenai/cord19/blob/master/README.md)
### Annotations
No annotations
#### Annotation process
N/A
#### Who are the annotators?
N/A
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@article{Wang2020CORD19TC,
title={CORD-19: The Covid-19 Open Research Dataset},
author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and
K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and
Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and
D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier},
journal={ArXiv},
year={2020}
}
```
### Contributions
Thanks to [@ggdupont](https://github.com/ggdupont) for adding this dataset.
# CORD-19 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** [https://www.semanticscholar.org/cord19](https://www.semanticscholar.org/cord19)
- **代码仓库:** [https://github.com/allenai/cord19](https://github.com/allenai/cord19)
- **相关论文:** [CORD-19: COVID-19开放研究数据集](https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1/)
- **排行榜:** [Kaggle挑战赛](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
### 数据集概述
CORD-19是针对COVID-19及相关冠状病毒研究的学术论文语料库,由艾伦人工智能研究所(Allen Institute for AI)的语义学者(Semantic Scholar)团队打造并维护,旨在支持文本挖掘与自然语言处理(NLP)研究。
### 支持任务与排行榜
详见相关[Kaggle挑战赛任务说明](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks)
### 语言
本数据集语言为英语(en)。
## 数据集结构
### 数据实例
以下代码块以类JSON语法展示了单条样本的概览(因部分字段过长已做简化):
{
"abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified [...]",
"authors": "Madani, Tariq A; Al-Ghamdi, Aisha A",
"cord_uid": "ug7v899j",
"doc_embeddings": [
-2.939983606338501,
-6.312200546264648,
-1.0459030866622925,
[...] 766 values in total [...]
-4.107113361358643,
-3.8174145221710205,
1.8976187705993652,
5.811529159545898,
-2.9323840141296387
],
"doi": "10.1186/1471-2334-1-6",
"journal": "BMC Infect Dis",
"publish_time": "2001-07-04",
"sha": "d1aafb70c066a2068b02786f8929fd9c900897fb",
"source_x": "PMC",
"title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",
"url": "https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/"
}
### 数据字段
目前仅集成了以下字段:`cord_uid`、`sha`、`source_x`、`title`、`doi`、`abstract`、`publish_time`、`authors`、`journal`。若启用`fulltext`配置,`pdf_json_files`中转录的章节会被转换为`fulltext`字段。
- `cord_uid`: 字符串(str)类型字段,为每篇CORD-19论文分配唯一标识符。需注意,该字段并非每行唯一,相关说明详见常见问题解答。
- `sha`: 字符串列表(List[str])类型字段,为与该CORD-19论文关联的所有PDF文件的SHA1哈希值。大多数论文在此字段仅有0或1个值(取决于是否获取到PDF文件),但部分论文可能包含多个值。例如,主论文可能附带单独的补充材料PDF,或存在两份相同论文的PDF副本。若存在多个PDF文件,其SHA1哈希值将以分号分隔(例如:`'4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0247db5e916c20eae3f6d772e8572eb828236'`)。
- `source_x`: 字符串列表(List[str])类型字段,为获取该论文的来源名称列表,同样以分号分隔。例如:`'ArXiv; Elsevier; PMC; WHO'`。每个条目至少包含一个来源。
- `title`: 字符串(str)类型字段,存储论文标题。
- `doi`: 字符串(str)类型字段,存储论文的DOI标识。
- `pmcid`: 字符串(str)类型字段,存储论文在PubMed Central上的ID,格式应为`PMC`后跟整数。
- `pubmed_id`: 整数(int)类型字段,存储论文在PubMed上的ID。
- `license`: 字符串(str)类型字段,存储该论文关联的最宽松许可协议,可选值包括:`'cc0'`、`'hybrid-oa'`、`'els-covid'`、`'no-cc'`、`'cc-by-nc-sa'`、`'cc-by'`、`'gold-oa'`、`'biorxiv'`、`'green-oa'`、`'bronze-oa'`、`'cc-by-nc'`、`'medrxiv'`、`'cc-by-nd'`、`'arxiv'`、`'unk'`、`'cc-by-sa'`、`'cc-by-nc-nd'`。
- `abstract`: 字符串(str)类型字段,存储论文摘要。
- `publish_time`: 字符串(str)类型字段,存储论文发表日期,格式为`yyyy-mm-dd`。该字段并非始终准确,部分出版商可能使用未来日期(如`yyyy-12-31`)标注未知日期。
- `authors`: 字符串列表(List[str])类型字段,存储论文作者列表。每位作者的姓名格式为`姓氏, 名字 中间名`,并以分号分隔。
- `journal`: 字符串(str)类型字段,存储发表论文的期刊名称。字符串未做归一化处理(例如可同时存在`BMJ`和`British Medical Journal`两种形式),若未知则为空字符串。
- `mag_id`: 已弃用字段,最初为微软学术图谱(Microsoft Academic Graph)中存储的论文整数ID。
- `who_covidence_id`: 字符串(str)类型字段,存储世界卫生组织(WHO)为该论文分配的ID,格式类似`#72306`。
- `arxiv_id`: 字符串(str)类型字段,存储该论文的arXiv编号。
- `pdf_json_files`: 字符串列表(List[str])类型字段,存储当前数据转储版本根目录下,论文PDF解析为JSON格式的文件路径,多个路径以分号分隔。示例:`document_parses/pdf_json/4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a.json; document_parses/pdf_json/d4f0247db5e916c20eae3f6d772e8572eb828236.json`。
- `pmc_json_files`: 字符串列表(List[str])类型字段,与上述字段类似,但对应从PMC下载的全文XML文件解析后的JSON格式文件。
- `url`: 字符串列表(List[str])类型字段,存储与该论文关联的所有URL,以分号分隔。
- `s2_id`: 字符串(str)类型字段,存储该论文的Semantic Scholar ID,可通过Semantic Scholar API调用(例如`s2_id=9445722`对应`http://api.semanticscholar.org/corpusid:9445722`)。
基于加载时选择的配置,还可包含以下额外字段:
- `fulltext`: 字符串(str)类型字段,包含从PDF解析得到的JSON文件中所有文本章节的拼接内容。
- `doc_embeddings`: 浮点数值序列,存储文档嵌入向量(从以逗号分隔的字符串解析得到)。关于提取嵌入向量的系统详情,可参阅:[SPECTER: 基于引用感知Transformer的文档级表示学习](https://arxiv.org/abs/2004.07180)。简而言之,该系统依赖于在文档级关联性上预训练的BERT模型,训练数据来自引用图。可通过REST接口调用该系统(详见[公共API文档](https://github.com/allenai/paper-embedding-public-apis))。
### 数据划分
本数据集未提供标注,所有实例均归入训练划分。各配置下的数据集规模如下:
| | 训练集 |
|------------|-------:|
| metadata | 368618 |
| fulltext | 368618 |
| embeddings | 368618 |
## 数据集构建
### 遴选依据
详见[官方自述文件](https://github.com/allenai/cord19/blob/master/README.md)
### 源数据
详见[官方自述文件](https://github.com/allenai/cord19/blob/master/README.md)
#### 初始数据收集与归一化
详见[官方自述文件](https://github.com/allenai/cord19/blob/master/README.md)
#### 源语言生成者是谁?
详见[官方自述文件](https://github.com/allenai/cord19/blob/master/README.md)
### 标注信息
无标注
#### 标注流程
不适用(N/A)
#### 标注者是谁?
不适用(N/A)
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
@article{Wang2020CORD19TC,
title={CORD-19: The Covid-19 Open Research Dataset},
author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and
K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and
Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and
D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier},
journal={ArXiv},
year={2020}
}
### 贡献致谢
感谢[@ggdupont](https://github.com/ggdupont)添加本数据集。
提供机构:
maas
创建时间:
2025-05-29



