five

xstory_cloze

收藏
魔搭社区2025-11-27 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/opencompass/xstory_cloze
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for XStoryCloze ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://cs.rochester.edu/nlp/rocstories/](https://cs.rochester.edu/nlp/rocstories/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Few-shot Learning with Multilingual Generative Language Models](https://arxiv.org/pdf/2112.10668.pdf) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 2.03 MB - **Size of the generated dataset:** 2.03 MB - **Total amount of disk used:** 2.05 MB ### Dataset Summary XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI. ### Supported Tasks and Leaderboards commonsense reasoning ### Languages en, ru, zh (Simplified), es (Latin America), ar, hi, id, te, sw, eu, my. ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 2.03 MB - **Size of the generated dataset:** 2.03 MB - **Total amount of disk used:** 2.05 MB An example of 'train' looks as follows. ``` {'answer_right_ending': 1, 'input_sentence_1': 'Rick grew up in a troubled household.', 'input_sentence_2': 'He never found good support in family, and turned to gangs.', 'input_sentence_3': "It wasn't long before Rick got shot in a robbery.", 'input_sentence_4': 'The incident caused him to turn a new leaf.', 'sentence_quiz1': 'He is happy now.', 'sentence_quiz2': 'He joined a gang.', 'story_id': '138d5bfb-05cc-41e3-bf2c-fa85ebad14e2'} ``` ### Data Fields The data fields are the same among all splits. - `input_sentence_1`: The first statement in the story. - `input_sentence_2`: The second statement in the story. - `input_sentence_3`: The third statement in the story. - `input_sentence_4`: The forth statement in the story. - `sentence_quiz1`: first possible continuation of the story. - `sentence_quiz2`: second possible continuation of the story. - `answer_right_ending`: correct possible ending; either 1 or 2. - `story_id`: story id. ### Data Splits This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. We split the data for each language into train and test (360 vs. 1510 examples, respectively). The released data files for different languages maintain a line-by-line alignment. | name |train |test| |-------|-----:|---:| |en|360|1510| |ru|360|1510| |zh|360|1510| |es|360|1510| |ar|360|1510| |hi|360|1510| |id|360|1510| |te|360|1510| |sw|360|1510| |eu|360|1510| |my|360|1510| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information XStoryCloze is opensourced under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode), the same license as the original English StoryCloze. ### Citation Information ``` @article{DBLP:journals/corr/abs-2112-10668, author = {Xi Victoria Lin and Todor Mihaylov and Mikel Artetxe and Tianlu Wang and Shuohui Chen and Daniel Simig and Myle Ott and Naman Goyal and Shruti Bhosale and Jingfei Du and Ramakanth Pasunuru and Sam Shleifer and Punit Singh Koura and Vishrav Chaudhary and Brian O'Horo and Jeff Wang and Luke Zettlemoyer and Zornitsa Kozareva and Mona T. Diab and Veselin Stoyanov and Xian Li}, title = {Few-shot Learning with Multilingual Language Models}, journal = {CoRR}, volume = {abs/2112.10668}, year = {2021}, url = {https://arxiv.org/abs/2112.10668}, eprinttype = {arXiv}, eprint = {2112.10668}, timestamp = {Tue, 04 Jan 2022 15:59:27 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ### Contributions Thanks to [@juletx](https://github.com/juletx).

# XStoryCloze 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [支持语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页:** [https://cs.rochester.edu/nlp/rocstories/](https://cs.rochester.edu/nlp/rocstories/) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [多语言生成语言模型的少样本学习](https://arxiv.org/pdf/2112.10668.pdf) - **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 2.03 MB - **生成后数据集大小:** 2.03 MB - **总磁盘占用量:** 2.05 MB ### 数据集概述 XStoryCloze 由[英语StoryCloze数据集(English StoryCloze,2016春季版)]经专业翻译而来,共覆盖10种非英语语言。本数据集由Meta AI发布。 ### 支持任务与基准排行榜 常识推理 ### 支持语言 en(英语)、ru(俄语)、zh(简体中文)、es(拉美西班牙语)、ar(阿拉伯语)、hi(印地语)、id(印尼语)、te(泰卢固语)、sw(斯瓦希里语)、eu(巴斯克语)、my(缅甸语)。 ## 数据集结构 ### 数据实例 - **下载数据集文件大小:** 2.03 MB - **生成后数据集大小:** 2.03 MB - **总磁盘占用量:** 2.05 MB 训练集的一个示例如下: {'answer_right_ending': 1, 'input_sentence_1': 'Rick grew up in a troubled household.', 'input_sentence_2': 'He never found good support in family, and turned to gangs.', 'input_sentence_3': "It wasn't long before Rick got shot in a robbery.", 'input_sentence_4': 'The incident caused him to turn a new leaf.', 'sentence_quiz1': 'He is happy now.', 'sentence_quiz2': 'He joined a gang.', 'story_id': '138d5bfb-05cc-41e3-bf2c-fa85ebad14e2'} ### 数据字段 所有数据划分下的字段均保持一致: - `input_sentence_1`:故事的第一句陈述 - `input_sentence_2`:故事的第二句陈述 - `input_sentence_3`:故事的第三句陈述 - `input_sentence_4`:故事的第四句陈述 - `sentence_quiz1`:故事的第一种可能续句 - `sentence_quiz2`:故事的第二种可能续句 - `answer_right_ending`:正确的故事续句,取值为1或2 - `story_id`:故事唯一标识符 ### 数据划分 本数据集旨在用于评估多语言大语言模型(Large Language Model,LLM)的零样本(zero-shot)与少样本(few-shot)学习能力。我们将每种语言的数据集划分为训练集与测试集,样本量分别为360与1510。不同语言的发布数据文件保持逐行对齐。 | 数据集名称 | 训练集样本数 | 测试集样本数 | |-------|-----:|---:| |en|360|1510| |ru|360|1510| |zh|360|1510| |es|360|1510| |ar|360|1510| |hi|360|1510| |id|360|1510| |te|360|1510| |sw|360|1510| |eu|360|1510| |my|360|1510| ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言数据的生产者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可协议信息 XStoryCloze 采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)开源协议,与原始英语StoryCloze数据集保持一致。 ### 引用信息 @article{DBLP:journals/corr/abs-2112-10668, author = {Xi Victoria Lin and Todor Mihaylov and Mikel Artetxe and Tianlu Wang and Shuohui Chen and Daniel Simig and Myle Ott and Naman Goyal and Shruti Bhosale and Jingfei Du and Ramakanth Pasunuru and Sam Shleifer and Punit Singh Koura and Vishrav Chaudhary and Brian O'Horo and Jeff Wang and Luke Zettlemoyer and Zornitsa Kozareva and Mona T. Diab and Veselin Stoyanov and Xian Li}, title = {Few-shot Learning with Multilingual Language Models}, journal = {CoRR}, volume = {abs/2112.10668}, year = {2021}, url = {https://arxiv.org/abs/2112.10668}, eprinttype = {arXiv}, eprint = {2112.10668}, timestamp = {Tue, 04 Jan 2022 15:59:27 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2112.10668.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ### 贡献者 感谢 [@juletx](https://github.com/juletx)。
提供机构:
maas
创建时间:
2024-07-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作