allenai/cosmos_qa
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/allenai/cosmos_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- found
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: CosmosQA
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- multiple-choice
task_ids:
- multiple-choice-qa
paperswithcode_id: cosmosqa
dataset_info:
features:
- name: id
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answer0
dtype: string
- name: answer1
dtype: string
- name: answer2
dtype: string
- name: answer3
dtype: string
- name: label
dtype: int32
splits:
- name: train
num_bytes: 17159918
num_examples: 25262
- name: test
num_bytes: 5121479
num_examples: 6963
- name: validation
num_bytes: 2186987
num_examples: 2985
download_size: 24399475
dataset_size: 24468384
---
# Dataset Card for "cosmos_qa"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://wilburone.github.io/cosmos/](https://wilburone.github.io/cosmos/)
- **Repository:** https://github.com/wilburOne/cosmosqa/
- **Paper:** [Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning](https://arxiv.org/abs/1909.00277)
- **Point of Contact:** [Lifu Huang](mailto:warrior.fu@gmail.com)
- **Size of downloaded dataset files:** 24.40 MB
- **Size of the generated dataset:** 24.51 MB
- **Total amount of disk used:** 48.91 MB
### Dataset Summary
Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 24.40 MB
- **Size of the generated dataset:** 24.51 MB
- **Total amount of disk used:** 48.91 MB
An example of 'validation' looks as follows.
```
This example was too long and was cropped:
{
"answer0": "If he gets married in the church he wo nt have to get a divorce .",
"answer1": "He wants to get married to a different person .",
"answer2": "He wants to know if he does nt like this girl can he divorce her ?",
"answer3": "None of the above choices .",
"context": "\"Do i need to go for a legal divorce ? I wanted to marry a woman but she is not in the same religion , so i am not concern of th...",
"id": "3BFF0DJK8XA7YNK4QYIGCOG1A95STE##3180JW2OT5AF02OISBX66RFOCTG5J7##A2LTOS0AZ3B28A##Blog_56156##q1_a1##378G7J1SJNCDAAIN46FM2P7T6KZEW2",
"label": 1,
"question": "Why is this person asking about divorce ?"
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `id`: a `string` feature.
- `context`: a `string` feature.
- `question`: a `string` feature.
- `answer0`: a `string` feature.
- `answer1`: a `string` feature.
- `answer2`: a `string` feature.
- `answer3`: a `string` feature.
- `label`: a `int32` feature.
### Data Splits
| name |train|validation|test|
|-------|----:|---------:|---:|
|default|25262| 2985|6963|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
As reported via email by Yejin Choi, the dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
### Citation Information
```
@inproceedings{huang-etal-2019-cosmos,
title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning",
author = "Huang, Lifu and
Le Bras, Ronan and
Bhagavatula, Chandra and
Choi, Yejin",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1243",
doi = "10.18653/v1/D19-1243",
pages = "2391--2401",
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
---
annotations_creators:
- 众包标注
language:
- 英语(English)
language_creators:
- 公开采集
license:
- 知识共享署名4.0国际许可协议(CC BY 4.0)
multilinguality:
- 单语言
pretty_name: CosmosQA
size_categories:
- 10K<n<100K
source_datasets:
- 原生数据集
task_categories:
- 多项选择
task_ids:
- 多项选择问答(multiple-choice QA)
paperswithcode_id: cosmosqa
dataset_info:
features:
- name: 样本标识符
dtype: 字符串
- name: 上下文文本
dtype: 字符串
- name: 问题文本
dtype: 字符串
- name: 候选答案0
dtype: 字符串
- name: 候选答案1
dtype: 字符串
- name: 候选答案2
dtype: 字符串
- name: 候选答案3
dtype: 字符串
- name: 标签
dtype: 32位整数
splits:
- name: 训练集
num_bytes: 17159918
num_examples: 25262
- name: 测试集
num_bytes: 5121479
num_examples: 6963
- name: 验证集
num_bytes: 2186987
num_examples: 2985
download_size: 24399475
dataset_size: 24468384
---
# 「cosmos_qa」数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与基准榜单](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集概述
- **主页:** [https://wilburone.github.io/cosmos/](https://wilburone.github.io/cosmos/)
- **代码仓库:** https://github.com/wilburOne/cosmosqa/
- **相关论文:** [Cosmos QA:基于上下文常识推理的机器阅读理解](https://arxiv.org/abs/1909.00277)
- **联系人:** 黄立福(Lifu Huang),邮箱:warrior.fu@gmail.com
- **下载数据集大小:** 24.40 MB
- **生成数据集大小:** 24.51 MB
- **总磁盘占用:** 48.91 MB
### 数据集摘要
Cosmos QA是一个包含35.6K道需基于常识的阅读理解题的大规模数据集,采用多项选择题形式。该数据集聚焦于多样化的日常叙事文本的言外之意推理,提出的问题需对文本中未直接给出的事件的可能原因或结果进行推理,超越了仅对文本片段的字面理解。
### 支持任务与基准榜单
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据样例
#### 默认配置
- **下载数据集大小:** 24.40 MB
- **生成数据集大小:** 24.51 MB
- **总磁盘占用:** 48.91 MB
以下是验证集的一个样例:
该样例过长已被截断:
{
"answer0": "若他在教堂举办婚礼,便无需办理离婚手续。",
"answer1": "他想与另一个人结婚。",
"answer2": "他想知道如果他不喜欢这个女孩,能否和她离婚?",
"answer3": "以上选项均不符合。",
"context": ""我需要办理合法离婚吗?我想娶一位女性,但她的宗教信仰与我不同,所以我并不担心...",
"id": "3BFF0DJK8XA7YNK4QYIGCOG1A95STE##3180JW2OT5AF02OISBX66RFOCTG5J7##A2LTOS0AZ3B28A##Blog_56156##q1_a1##378G7J1SJNCDAAIN46FM2P7T6KZEW2",
"label": 1,
"question": "此人为何询问离婚相关问题?"
}
### 数据字段
所有数据划分的字段结构均一致。
#### 默认配置
- `id`: 字符串类型的样本标识符。
- `context`: 字符串类型的上下文文本。
- `question`: 字符串类型的问题文本。
- `answer0`: 字符串类型的第一个候选答案。
- `answer1`: 字符串类型的第二个候选答案。
- `answer2`: 字符串类型的第三个候选答案。
- `answer3`: 字符串类型的第四个候选答案。
- `label`: 32位整数类型的标签,代表正确候选答案的索引。
### 数据划分
| 划分名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 |
|-------|----:|---------:|---:|
|默认配置|25262| 2985|6963|
## 数据集构建
### 构建初衷
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生成者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
据Yejin Choi通过邮件告知,本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)进行授权。
### 引用信息
@inproceedings{huang-etal-2019-cosmos,
title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning",
author = "Huang, Lifu and
Le Bras, Ronan and
Bhagavatula, Chandra and
Choi, Yejin",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1243",
doi = "10.18653/v1/D19-1243",
pages = "2391--2401",
}
### 贡献者
感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@lewtun](https://github.com/lewtun)、[@albertvillanova](https://github.com/albertvillanova)、[@thomwolf](https://github.com/thomwolf)为本数据集的添加工作。
提供机构:
allenai
原始信息汇总
数据集概述
数据集名称
- 名称: CosmosQA
- 别名: cosmos_qa
数据集基本信息
- 语言: 英语 (en)
- 语言创建方式: 发现 (found)
- 许可证: CC BY 4.0
- 多语言性: 单语 (monolingual)
- 数据集大小: 10K<n<100K
- 源数据集: 原始 (original)
- 任务类别: 多项选择 (multiple-choice)
- 任务ID: multiple-choice-qa
数据集特征
- id: 字符串类型
- context: 字符串类型
- question: 字符串类型
- answer0: 字符串类型
- answer1: 字符串类型
- answer2: 字符串类型
- answer3: 字符串类型
- label: 整数32位类型
数据集分割
- 训练集: 25262个样本,占用17159918字节
- 测试集: 6963个样本,占用5121479字节
- 验证集: 2985个样本,占用2186987字节
- 下载大小: 24399475字节
- 数据集大小: 24468384字节
数据集创建
-
注释创建者: 众包 (crowdsourced)
-
许可证信息: 通过电子邮件报告,数据集根据CC BY 4.0许可证授权
-
引用信息:
@inproceedings{huang-etal-2019-cosmos, title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning", author = "Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1243", doi = "10.18653/v1/D19-1243", pages = "2391--2401", }
搜集汇总
数据集介绍

构建方式
Cosmos QA数据集的构建基于对日常生活中的叙事进行深入理解,并要求参与者根据上下文进行推理,以回答关于事件可能原因或影响的多个选择题。该数据集的构建采用了众包的方式,通过精心设计的标注流程,确保了问题与答案的质量。
使用方法
使用Cosmos QA数据集时,用户可以根据自己的需求选择合适的 splits(训练集、验证集或测试集)。数据集以JSON格式存储,包含了问题ID、上下文、问题、四个可能的答案以及正确答案标签。用户可以通过HuggingFace的datasets库轻松加载并使用这些数据,进行机器阅读理解和常识推理相关的任务训练和评估。
背景与挑战
背景概述
Cosmos QA数据集,由Lifu Huang等人于2019年创建,旨在推动机器阅读理解领域的发展,特别是关注常识推理在上下文中的应用。该数据集包含35.6K个以多选问题形式呈现的阅读理解问题,这些问题基于人们的日常叙事,要求参与者推理事件的可能原因或效果,而不仅仅是文本字面上的内容。Cosmos QA数据集的创建,对提升机器在理解复杂情境和进行深度推理方面的能力具有重要的研究价值,对自然语言处理领域产生了显著影响。
当前挑战
该数据集在构建过程中遇到的挑战主要包括:如何设计能够准确反映常识推理能力的问题,并确保这些问题覆盖多样的日常情境;如何通过众包的方式收集到高质量的数据,并保证标注的准确性和一致性;此外,数据集中可能存在的偏差和局限性,以及如何确保数据的使用不会引发社会伦理和法律问题,都是使用和扩展该数据集时需要考虑的重要挑战。
常用场景
经典使用场景
在自然语言处理领域,尤其是在机器阅读理解的研究与应用中,CosmosQA数据集以其独特的 Commonsense Reasoning 问题设定,成为检验模型深层次理解能力的重要基准。该数据集通过提供包含丰富上下文信息的叙述,并要求模型推断出事件的可能原因或结果,为研究者提供了一个展示模型推理能力的平台。
解决学术问题
CosmosQA数据集解决了传统阅读理解数据集中缺乏推理和常识检验的问题。它要求模型不仅要理解文本的字面意义,还要具备跨越文本跨度进行推理的能力,这对于推动机器阅读理解技术的发展,提升模型对复杂语境的理解力具有重要意义。
实际应用
在实际应用中,CosmosQA数据集可用于训练和评估智能对话系统、推荐系统以及需要模拟人类常识推理的任何AI应用。其提供的推理训练有助于模型更好地理解和响应用户的隐含需求,从而提升用户体验和系统的实用性。
数据集最近研究
最新研究方向
Cosmos QA数据集作为大规模的基于常识推理的阅读理解问题集,近期研究方向主要聚焦于提升机器在理解和推理复杂语境中的能力。该数据集的独到之处在于,它不仅要求模型理解文本字面意思,更需具备推断事件背后可能原因或结果的能力。目前,研究界正致力于探索深度学习模型在此类任务上的表现,尤其是如何通过上下文共通常识推理来提高多选问答的准确性。此外,也有研究关注于数据集的偏差和公平性问题,以确保模型在不同群体和情境下的一致表现。Cosmos QA的持续研究和改进,对于推动自然语言处理领域的发展具有重要意义,特别是在提升机器对人类日常叙事的理解深度和广度方面。
以上内容由遇见数据集搜集并总结生成



