break_data

Name: break_data
Creator: maas
Published: 2025-11-27 16:35:29
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/break_data

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "break_data" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/allenai/Break](https://github.com/allenai/Break) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 79.86 MB - **Size of the generated dataset:** 155.55 MB - **Total amount of disk used:** 235.39 MB ### Dataset Summary Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### QDMR - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 15.93 MB - **Total amount of disk used:** 31.90 MB An example of 'validation' looks as follows. ``` { "decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available", "operators": "['select', 'filter', 'filter', 'filter']", "question_id": "ATIS_dev_0", "question_text": "what flights are available tomorrow from denver to philadelphia ", "split": "dev" } ``` #### QDMR-high-level - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 6.54 MB - **Total amount of disk used:** 22.51 MB An example of 'train' looks as follows. ``` { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } ``` #### QDMR-high-level-lexicon - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 31.64 MB - **Total amount of disk used:** 47.61 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...", "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?" } ``` #### QDMR-lexicon - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 77.19 MB - **Total amount of disk used:** 93.16 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...", "source": "what flights are available tomorrow from denver to philadelphia " } ``` #### logical-forms - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 24.25 MB - **Total amount of disk used:** 40.22 MB An example of 'train' looks as follows. ``` { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "program": "some program", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } ``` ### Data Fields The data fields are the same among all splits. #### QDMR - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. #### QDMR-high-level - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. #### QDMR-high-level-lexicon - `source`: a `string` feature. - `allowed_tokens`: a `string` feature. #### QDMR-lexicon - `source`: a `string` feature. - `allowed_tokens`: a `string` feature. #### logical-forms - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. - `program`: a `string` feature. ### Data Splits | name |train|validation|test| |-----------------------|----:|---------:|---:| |QDMR |44321| 7760|8069| |QDMR-high-level |17503| 3130|3195| |QDMR-high-level-lexicon|17503| 3130|3195| |QDMR-lexicon |44321| 7760|8069| |logical-forms |44098| 7719|8006| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{Wolfson2020Break, title={Break It Down: A Question Understanding Benchmark}, author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan}, journal={Transactions of the Association for Computational Linguistics}, year={2020}, } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

# 「break_data」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** [https://github.com/allenai/Break](https://github.com/allenai/Break) - **代码仓库:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 79.86 MB - **生成后数据集大小:** 155.55 MB - **总磁盘占用:** 235.39 MB ### 数据集概述 Break是一个人工标注的自然语言问题及其**问题分解语义表示（Question Decomposition Meaning Representations，QDMR）**数据集。Break包含83,978个示例，采样自10个针对文本、图像和数据库的问答数据集。本仓库包含Break数据集及其准确数据格式的相关说明。 ### 支持任务与排行榜 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### QDMR - **下载数据集文件大小:** 15.97 MB - **生成后数据集大小:** 15.93 MB - **总磁盘占用:** 31.90 MB 「验证集」的一个示例如下： { "decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available", "operators": "['select', 'filter', 'filter', 'filter']", "question_id": "ATIS_dev_0", "question_text": "what flights are available tomorrow from denver to philadelphia ", "split": "dev" } #### QDMR-high-level - **下载数据集文件大小:** 15.97 MB - **生成后数据集大小:** 6.54 MB - **总磁盘占用:** 22.51 MB 「训练集」的一个示例如下： { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } #### QDMR-high-level-lexicon - **下载数据集文件大小:** 15.97 MB - **生成后数据集大小:** 31.64 MB - **总磁盘占用:** 47.61 MB 「训练集」的一个示例如下： This example was too long and was cropped: { "allowed_tokens": "['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...", "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?" } #### QDMR-lexicon - **下载数据集文件大小:** 15.97 MB - **生成后数据集大小:** 77.19 MB - **总磁盘占用:** 93.16 MB 「验证集」的一个示例如下： This example was too long and was cropped: { "allowed_tokens": "['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...", "source": "what flights are available tomorrow from denver to philadelphia " } #### logical-forms - **下载数据集文件大小:** 15.97 MB - **生成后数据集大小:** 24.25 MB - **总磁盘占用:** 40.22 MB 「训练集」的一个示例如下： { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "program": "some program", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } ### 数据字段所有划分的数据字段均保持一致。 #### QDMR - `question_id`: 字符串类型特征 - `question_text`: 字符串类型特征 - `decomposition`: 字符串类型特征 - `operators`: 字符串类型特征 - `split`: 字符串类型特征 #### QDMR-high-level - `question_id`: 字符串类型特征 - `question_text`: 字符串类型特征 - `decomposition`: 字符串类型特征 - `operators`: 字符串类型特征 - `split`: 字符串类型特征 #### QDMR-high-level-lexicon - `source`: 字符串类型特征 - `allowed_tokens`: 字符串类型特征 #### QDMR-lexicon - `source`: 字符串类型特征 - `allowed_tokens`: 字符串类型特征 #### logical-forms - `question_id`: 字符串类型特征 - `question_text`: 字符串类型特征 - `decomposition`: 字符串类型特征 - `operators`: 字符串类型特征 - `split`: 字符串类型特征 - `program`: 字符串类型特征 ### 数据划分 | 名称 | 训练集 | 验证集 | 测试集 | |-----------------------|-------:|-------:|-------:| | QDMR | 44321 | 7760 | 8069 | | QDMR-high-level | 17503 | 3130 | 3195 | | QDMR-high-level-lexicon| 17503 | 3130 | 3195 | | QDMR-lexicon | 44321 | 7760 | 8069 | | logical-forms | 44098 | 7719 | 8006 | ## 数据集构建 ### 构建初衷 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @article{Wolfson2020Break, title={Break It Down: A Question Understanding Benchmark}, author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan}, journal={Transactions of the Association for Computational Linguistics}, year={2020}, } ### 贡献致谢感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@lewtun](https://github.com/lewtun)、[@thomwolf](https://github.com/thomwolf)为本数据集的添加所做的贡献。

提供机构：

maas

创建时间：

2025-05-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集