allenai/cosmos_qa

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/allenai/cosmos_qa

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: CosmosQA size_categories: - 10K<n<100K source_datasets: - original task_categories: - multiple-choice task_ids: - multiple-choice-qa paperswithcode_id: cosmosqa dataset_info: features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answer0 dtype: string - name: answer1 dtype: string - name: answer2 dtype: string - name: answer3 dtype: string - name: label dtype: int32 splits: - name: train num_bytes: 17159918 num_examples: 25262 - name: test num_bytes: 5121479 num_examples: 6963 - name: validation num_bytes: 2186987 num_examples: 2985 download_size: 24399475 dataset_size: 24468384 --- # Dataset Card for "cosmos_qa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://wilburone.github.io/cosmos/](https://wilburone.github.io/cosmos/) - **Repository:** https://github.com/wilburOne/cosmosqa/ - **Paper:** [Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning](https://arxiv.org/abs/1909.00277) - **Point of Contact:** [Lifu Huang](mailto:warrior.fu@gmail.com) - **Size of downloaded dataset files:** 24.40 MB - **Size of the generated dataset:** 24.51 MB - **Total amount of disk used:** 48.91 MB ### Dataset Summary Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 24.40 MB - **Size of the generated dataset:** 24.51 MB - **Total amount of disk used:** 48.91 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "answer0": "If he gets married in the church he wo nt have to get a divorce .", "answer1": "He wants to get married to a different person .", "answer2": "He wants to know if he does nt like this girl can he divorce her ?", "answer3": "None of the above choices .", "context": "\"Do i need to go for a legal divorce ? I wanted to marry a woman but she is not in the same religion , so i am not concern of th...", "id": "3BFF0DJK8XA7YNK4QYIGCOG1A95STE##3180JW2OT5AF02OISBX66RFOCTG5J7##A2LTOS0AZ3B28A##Blog_56156##q1_a1##378G7J1SJNCDAAIN46FM2P7T6KZEW2", "label": 1, "question": "Why is this person asking about divorce ?" } ``` ### Data Fields The data fields are the same among all splits. #### default - `id`: a `string` feature. - `context`: a `string` feature. - `question`: a `string` feature. - `answer0`: a `string` feature. - `answer1`: a `string` feature. - `answer2`: a `string` feature. - `answer3`: a `string` feature. - `label`: a `int32` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|25262| 2985|6963| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information As reported via email by Yejin Choi, the dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. ### Citation Information ``` @inproceedings{huang-etal-2019-cosmos, title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning", author = "Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1243", doi = "10.18653/v1/D19-1243", pages = "2391--2401", } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

提供机构：

allenai

原始信息汇总

数据集概述

数据集名称

名称: CosmosQA
别名: cosmos_qa

数据集基本信息

语言: 英语 (en)
语言创建方式: 发现 (found)
许可证: CC BY 4.0
多语言性: 单语 (monolingual)
数据集大小: 10K<n<100K
源数据集: 原始 (original)
任务类别: 多项选择 (multiple-choice)
任务ID: multiple-choice-qa

数据集特征

id: 字符串类型
context: 字符串类型
question: 字符串类型
answer0: 字符串类型
answer1: 字符串类型
answer2: 字符串类型
answer3: 字符串类型
label: 整数32位类型

数据集分割

训练集: 25262个样本，占用17159918字节
测试集: 6963个样本，占用5121479字节
验证集: 2985个样本，占用2186987字节
下载大小: 24399475字节
数据集大小: 24468384字节

数据集创建

注释创建者: 众包 (crowdsourced)
许可证信息: 通过电子邮件报告，数据集根据CC BY 4.0许可证授权
引用信息:

@inproceedings{huang-etal-2019-cosmos, title = "Cosmos {QA}: Machine Reading Comprehension with Contextual Commonsense Reasoning", author = "Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1243", doi = "10.18653/v1/D19-1243", pages = "2391--2401", }

搜集汇总

数据集介绍

构建方式

Cosmos QA数据集的构建基于对日常生活中的叙事进行深入理解，并要求参与者根据上下文进行推理，以回答关于事件可能原因或影响的多个选择题。该数据集的构建采用了众包的方式，通过精心设计的标注流程，确保了问题与答案的质量。

使用方法

使用Cosmos QA数据集时，用户可以根据自己的需求选择合适的 splits（训练集、验证集或测试集）。数据集以JSON格式存储，包含了问题ID、上下文、问题、四个可能的答案以及正确答案标签。用户可以通过HuggingFace的datasets库轻松加载并使用这些数据，进行机器阅读理解和常识推理相关的任务训练和评估。

背景与挑战

背景概述

Cosmos QA数据集，由Lifu Huang等人于2019年创建，旨在推动机器阅读理解领域的发展，特别是关注常识推理在上下文中的应用。该数据集包含35.6K个以多选问题形式呈现的阅读理解问题，这些问题基于人们的日常叙事，要求参与者推理事件的可能原因或效果，而不仅仅是文本字面上的内容。Cosmos QA数据集的创建，对提升机器在理解复杂情境和进行深度推理方面的能力具有重要的研究价值，对自然语言处理领域产生了显著影响。

当前挑战

该数据集在构建过程中遇到的挑战主要包括：如何设计能够准确反映常识推理能力的问题，并确保这些问题覆盖多样的日常情境；如何通过众包的方式收集到高质量的数据，并保证标注的准确性和一致性；此外，数据集中可能存在的偏差和局限性，以及如何确保数据的使用不会引发社会伦理和法律问题，都是使用和扩展该数据集时需要考虑的重要挑战。

常用场景

经典使用场景

在自然语言处理领域，尤其是在机器阅读理解的研究与应用中，CosmosQA数据集以其独特的 Commonsense Reasoning 问题设定，成为检验模型深层次理解能力的重要基准。该数据集通过提供包含丰富上下文信息的叙述，并要求模型推断出事件的可能原因或结果，为研究者提供了一个展示模型推理能力的平台。

解决学术问题

CosmosQA数据集解决了传统阅读理解数据集中缺乏推理和常识检验的问题。它要求模型不仅要理解文本的字面意义，还要具备跨越文本跨度进行推理的能力，这对于推动机器阅读理解技术的发展，提升模型对复杂语境的理解力具有重要意义。

实际应用

在实际应用中，CosmosQA数据集可用于训练和评估智能对话系统、推荐系统以及需要模拟人类常识推理的任何AI应用。其提供的推理训练有助于模型更好地理解和响应用户的隐含需求，从而提升用户体验和系统的实用性。

数据集最近研究