allenai/cosmos_qa
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/allenai/cosmos_qa
下载链接
链接失效反馈资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- found
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: CosmosQA
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- multiple-choice
task_ids:
- multiple-choice-qa
paperswithcode_id: cosmosqa
dataset_info:
features:
- name: id
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answer0
dtype: string
- name: answer1
dtype: string
- name: answer2
dtype: string
- name: answer3
dtype: string
- name: label
dtype: int32
splits:
- name: train
num_bytes: 17159918
num_examples: 25262
- name: test
num_bytes: 5121479
num_examples: 6963
- name: validation
num_bytes: 2186987
num_examples: 2985
download_size: 24399475
dataset_size: 24468384
---
# Dataset Card for "cosmos_qa"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://wilburone.github.io/cosmos/](https://wilburone.github.io/cosmos/)
- **Repository:** https://github.com/wilburOne/cosmosqa/
- **Paper:** [Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning](https://arxiv.org/abs/1909.00277)
- **Point of Contact:** [Lifu Huang](mailto:warrior.fu@gmail.com)
- **Size of downloaded dataset files:** 24.40 MB
- **Size of the generated dataset:** 24.51 MB
- **Total amount of disk used:** 48.91 MB
### Dataset Summary
Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 24.40 MB
- **Size of the generated dataset:** 24.51 MB
- **Total amount of disk used:** 48.91 MB
An example of 'validation' looks as follows.
```
This example was too long and was cropped:
{
"answer0": "If he gets married in the church he wo nt have to get a divorce .",
"answer1": "He wants to get married to a different person .",
"answer2": "He wants to know if he does nt like this girl can he divorce her ?",
"answer3": "None of the above choices .",
"context": "\"Do i need to go for a legal divorce ? I wanted to marry a woman but she is not in the same religion , so i am not concern of th...",
"id": "3BFF0DJK8XA7YNK4QYIGCOG1A95STE##3180JW2OT5AF02OISBX66RFOCTG5J7##A2LTOS0AZ3B28A##Blog_56156##q1_a1##378G7J1SJNCDAAIN46FM2P7T6KZEW2",
"label": 1,
"question": "Why is this person asking about divorce ?"
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `id`: a `string` feature.
- `context`: a `string` feature.
- `question`: a `string` feature.
- `answer0`: a `string` feature.
- `answer1`: a `string` feature.
- `answer2`: a `string` feature.
- `answer3`: a `string` feature.
- `label`: a `int32` feature.
### Data Splits
| name |train|validation|test|
|-------|----:|---------:|---:|
|default|25262| 2985|6963|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
As reported via email by Yejin Choi, the dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
### Citation Information
```
@inproceedings{huang-etal-2019-cosmos,
title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning",
author = "Huang, Lifu and
Le Bras, Ronan and
Bhagavatula, Chandra and
Choi, Yejin",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1243",
doi = "10.18653/v1/D19-1243",
pages = "2391--2401",
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
allenai
原始信息汇总
数据集概述
数据集名称
- 名称: CosmosQA
- 别名: cosmos_qa
数据集基本信息
- 语言: 英语 (en)
- 语言创建方式: 发现 (found)
- 许可证: CC BY 4.0
- 多语言性: 单语 (monolingual)
- 数据集大小: 10K<n<100K
- 源数据集: 原始 (original)
- 任务类别: 多项选择 (multiple-choice)
- 任务ID: multiple-choice-qa
数据集特征
- id: 字符串类型
- context: 字符串类型
- question: 字符串类型
- answer0: 字符串类型
- answer1: 字符串类型
- answer2: 字符串类型
- answer3: 字符串类型
- label: 整数32位类型
数据集分割
- 训练集: 25262个样本,占用17159918字节
- 测试集: 6963个样本,占用5121479字节
- 验证集: 2985个样本,占用2186987字节
- 下载大小: 24399475字节
- 数据集大小: 24468384字节
数据集创建
-
注释创建者: 众包 (crowdsourced)
-
许可证信息: 通过电子邮件报告,数据集根据CC BY 4.0许可证授权
-
引用信息:
@inproceedings{huang-etal-2019-cosmos, title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning", author = "Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1243", doi = "10.18653/v1/D19-1243", pages = "2391--2401", }
搜集汇总
数据集介绍

构建方式
Cosmos QA数据集的构建基于对日常生活中的叙事进行深入理解,并要求参与者根据上下文进行推理,以回答关于事件可能原因或影响的多个选择题。该数据集的构建采用了众包的方式,通过精心设计的标注流程,确保了问题与答案的质量。
使用方法
使用Cosmos QA数据集时,用户可以根据自己的需求选择合适的 splits(训练集、验证集或测试集)。数据集以JSON格式存储,包含了问题ID、上下文、问题、四个可能的答案以及正确答案标签。用户可以通过HuggingFace的datasets库轻松加载并使用这些数据,进行机器阅读理解和常识推理相关的任务训练和评估。
背景与挑战
背景概述
Cosmos QA数据集,由Lifu Huang等人于2019年创建,旨在推动机器阅读理解领域的发展,特别是关注常识推理在上下文中的应用。该数据集包含35.6K个以多选问题形式呈现的阅读理解问题,这些问题基于人们的日常叙事,要求参与者推理事件的可能原因或效果,而不仅仅是文本字面上的内容。Cosmos QA数据集的创建,对提升机器在理解复杂情境和进行深度推理方面的能力具有重要的研究价值,对自然语言处理领域产生了显著影响。
当前挑战
该数据集在构建过程中遇到的挑战主要包括:如何设计能够准确反映常识推理能力的问题,并确保这些问题覆盖多样的日常情境;如何通过众包的方式收集到高质量的数据,并保证标注的准确性和一致性;此外,数据集中可能存在的偏差和局限性,以及如何确保数据的使用不会引发社会伦理和法律问题,都是使用和扩展该数据集时需要考虑的重要挑战。
常用场景
经典使用场景
在自然语言处理领域,尤其是在机器阅读理解的研究与应用中,CosmosQA数据集以其独特的 Commonsense Reasoning 问题设定,成为检验模型深层次理解能力的重要基准。该数据集通过提供包含丰富上下文信息的叙述,并要求模型推断出事件的可能原因或结果,为研究者提供了一个展示模型推理能力的平台。
解决学术问题
CosmosQA数据集解决了传统阅读理解数据集中缺乏推理和常识检验的问题。它要求模型不仅要理解文本的字面意义,还要具备跨越文本跨度进行推理的能力,这对于推动机器阅读理解技术的发展,提升模型对复杂语境的理解力具有重要意义。
实际应用
在实际应用中,CosmosQA数据集可用于训练和评估智能对话系统、推荐系统以及需要模拟人类常识推理的任何AI应用。其提供的推理训练有助于模型更好地理解和响应用户的隐含需求,从而提升用户体验和系统的实用性。
数据集最近研究
最新研究方向
Cosmos QA数据集作为大规模的基于常识推理的阅读理解问题集,近期研究方向主要聚焦于提升机器在理解和推理复杂语境中的能力。该数据集的独到之处在于,它不仅要求模型理解文本字面意思,更需具备推断事件背后可能原因或结果的能力。目前,研究界正致力于探索深度学习模型在此类任务上的表现,尤其是如何通过上下文共通常识推理来提高多选问答的准确性。此外,也有研究关注于数据集的偏差和公平性问题,以确保模型在不同群体和情境下的一致表现。Cosmos QA的持续研究和改进,对于推动自然语言处理领域的发展具有重要意义,特别是在提升机器对人类日常叙事的理解深度和广度方面。
以上内容由遇见数据集搜集并总结生成



