cmrc2018
收藏魔搭社区2026-04-28 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/hfl/cmrc2018
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "cmrc2018"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/ymcui/cmrc2018](https://github.com/ymcui/cmrc2018)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 11.50 MB
- **Size of the generated dataset:** 22.31 MB
- **Total amount of disk used:** 33.83 MB
### Dataset Summary
A Span-Extraction dataset for Chinese machine reading comprehension to add language
diversities in this area. The dataset is composed by near 20,000 real questions annotated
on Wikipedia paragraphs by human experts. We also annotated a challenge set which
contains the questions that need comprehensive understanding and multi-sentence
inference throughout the context.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 11.50 MB
- **Size of the generated dataset:** 22.31 MB
- **Total amount of disk used:** 33.83 MB
An example of 'validation' looks as follows.
```
This example was too long and was cropped:
{
"answers": {
"answer_start": [11, 11],
"text": ["光荣和ω-force", "光荣和ω-force"]
},
"context": "\"《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...",
"id": "DEV_0_QUERY_0",
"question": "《战国无双3》是由哪两个公司合作开发的?"
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `id`: a `string` feature.
- `context`: a `string` feature.
- `question`: a `string` feature.
- `answers`: a dictionary feature containing:
- `text`: a `string` feature.
- `answer_start`: a `int32` feature.
### Data Splits
| name | train | validation | test |
| ------- | ----: | ---------: | ---: |
| default | 10142 | 3219 | 1002 |
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@inproceedings{cui-emnlp2019-cmrc2018,
title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
author = "Cui, Yiming and
Liu, Ting and
Che, Wanxiang and
Xiao, Li and
Chen, Zhipeng and
Ma, Wentao and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1600",
doi = "10.18653/v1/D19-1600",
pages = "5886--5891",
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
# 数据集卡片:cmrc2018
## 目录
- [数据集说明](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差说明](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集说明
- **主页:** [https://github.com/ymcui/cmrc2018](https://github.com/ymcui/cmrc2018)
- **代码仓库:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小:** 11.50 MB
- **生成数据集大小:** 22.31 MB
- **总磁盘占用:** 33.83 MB
### 数据集概览
本数据集为面向中文机器阅读理解的片段抽取(Span-Extraction)数据集,旨在丰富该领域的语言多样性。数据集包含近2万个由人类专家在维基百科段落上标注的真实问题,同时还构建了一个挑战集,其中的问题需要对上下文进行综合理解与多句推理方能作答。
### 支持任务与排行榜
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 默认配置
- **下载数据集文件大小:** 11.50 MB
- **生成数据集大小:** 22.31 MB
- **总磁盘占用:** 33.83 MB
以下是`validation`(验证集)的一个示例:
该示例过长已被截断:
{
"answers": {
"answer_start": [11, 11],
"text": ["光荣和ω-force", "光荣和ω-force"]
},
"context": ""《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...",
"id": "DEV_0_QUERY_0",
"question": "《战国无双3》是由哪两个公司合作开发的?"
}
### 数据字段
所有数据划分的字段格式保持一致。
#### 默认配置
- `id`:字符串(string)类型特征
- `context`:字符串(string)类型特征
- `question`:字符串(string)类型特征
- `answers`:字典类型特征,包含以下子字段:
- `text`:字符串(string)类型特征
- `answer_start`:int32类型特征
### 数据划分
| 划分名称 | 训练集 | 验证集 | 测试集 |
| ------- | ----: | ---------: | ---: |
| default | 10142 | 3219 | 1002 |
## 数据集构建
### 构建初衷
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言内容创作者是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差说明
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
@inproceedings{cui-emnlp2019-cmrc2018,
title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
author = "Cui, Yiming and
Liu, Ting and
Che, Wanxiang and
Xiao, Li and
Chen, Zhipeng and
Ma, Wentao and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1600",
doi = "10.18653/v1/D19-1600",
pages = "5886--5891",
}
### 贡献致谢
感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@thomwolf](https://github.com/thomwolf)为本数据集的收录提供的贡献。
提供机构:
maas
创建时间:
2025-08-29
搜集汇总
数据集介绍

背景与挑战
背景概述
cmrc2018是一个中文机器阅读理解数据集,包含约20,000个由人类专家标注的真实问题,支持跨度提取任务,旨在增加该领域的语言多样性。数据集分为训练集、验证集和测试集,分别包含10142、3219和1002个样本。
以上内容由遇见数据集搜集并总结生成



