allenai/break_data
收藏Hugging Face2024-01-11 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/allenai/break_data
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- en
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text2text-generation
task_ids:
- open-domain-abstractive-qa
paperswithcode_id: break
pretty_name: BREAK
dataset_info:
- config_name: QDMR
features:
- name: question_id
dtype: string
- name: question_text
dtype: string
- name: decomposition
dtype: string
- name: operators
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 12757200
num_examples: 44321
- name: validation
num_bytes: 2231632
num_examples: 7760
- name: test
num_bytes: 894558
num_examples: 8069
download_size: 5175508
dataset_size: 15883390
- config_name: QDMR-high-level
features:
- name: question_id
dtype: string
- name: question_text
dtype: string
- name: decomposition
dtype: string
- name: operators
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5134938
num_examples: 17503
- name: validation
num_bytes: 912408
num_examples: 3130
- name: test
num_bytes: 479919
num_examples: 3195
download_size: 3113187
dataset_size: 6527265
- config_name: QDMR-high-level-lexicon
features:
- name: source
dtype: string
- name: allowed_tokens
dtype: string
splits:
- name: train
num_bytes: 23227946
num_examples: 17503
- name: validation
num_bytes: 4157495
num_examples: 3130
- name: test
num_bytes: 4239547
num_examples: 3195
download_size: 5663924
dataset_size: 31624988
- config_name: QDMR-lexicon
features:
- name: source
dtype: string
- name: allowed_tokens
dtype: string
splits:
- name: train
num_bytes: 56896433
num_examples: 44321
- name: validation
num_bytes: 9934015
num_examples: 7760
- name: test
num_bytes: 10328787
num_examples: 8069
download_size: 10818266
dataset_size: 77159235
- config_name: logical-forms
features:
- name: question_id
dtype: string
- name: question_text
dtype: string
- name: decomposition
dtype: string
- name: operators
dtype: string
- name: split
dtype: string
- name: program
dtype: string
splits:
- name: train
num_bytes: 19783061
num_examples: 44098
- name: validation
num_bytes: 3498114
num_examples: 7719
- name: test
num_bytes: 920007
num_examples: 8006
download_size: 7572815
dataset_size: 24201182
configs:
- config_name: QDMR
data_files:
- split: train
path: QDMR/train-*
- split: validation
path: QDMR/validation-*
- split: test
path: QDMR/test-*
- config_name: QDMR-high-level
data_files:
- split: train
path: QDMR-high-level/train-*
- split: validation
path: QDMR-high-level/validation-*
- split: test
path: QDMR-high-level/test-*
- config_name: QDMR-high-level-lexicon
data_files:
- split: train
path: QDMR-high-level-lexicon/train-*
- split: validation
path: QDMR-high-level-lexicon/validation-*
- split: test
path: QDMR-high-level-lexicon/test-*
- config_name: QDMR-lexicon
data_files:
- split: train
path: QDMR-lexicon/train-*
- split: validation
path: QDMR-lexicon/validation-*
- split: test
path: QDMR-lexicon/test-*
- config_name: logical-forms
data_files:
- split: train
path: logical-forms/train-*
- split: validation
path: logical-forms/validation-*
- split: test
path: logical-forms/test-*
---
# Dataset Card for "break_data"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/allenai/Break](https://github.com/allenai/Break)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 79.86 MB
- **Size of the generated dataset:** 155.55 MB
- **Total amount of disk used:** 235.39 MB
### Dataset Summary
Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations
(QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases.
This repository contains the Break dataset along with information on the exact data format.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### QDMR
- **Size of downloaded dataset files:** 15.97 MB
- **Size of the generated dataset:** 15.93 MB
- **Total amount of disk used:** 31.90 MB
An example of 'validation' looks as follows.
```
{
"decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available",
"operators": "['select', 'filter', 'filter', 'filter']",
"question_id": "ATIS_dev_0",
"question_text": "what flights are available tomorrow from denver to philadelphia ",
"split": "dev"
}
```
#### QDMR-high-level
- **Size of downloaded dataset files:** 15.97 MB
- **Size of the generated dataset:** 6.54 MB
- **Total amount of disk used:** 22.51 MB
An example of 'train' looks as follows.
```
{
"decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
"operators": "['select', 'filter', 'filter', 'filter', 'project']",
"question_id": "ATIS_dev_102",
"question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
"split": "dev"
}
```
#### QDMR-high-level-lexicon
- **Size of downloaded dataset files:** 15.97 MB
- **Size of the generated dataset:** 31.64 MB
- **Total amount of disk used:** 47.61 MB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...",
"source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?"
}
```
#### QDMR-lexicon
- **Size of downloaded dataset files:** 15.97 MB
- **Size of the generated dataset:** 77.19 MB
- **Total amount of disk used:** 93.16 MB
An example of 'validation' looks as follows.
```
This example was too long and was cropped:
{
"allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...",
"source": "what flights are available tomorrow from denver to philadelphia "
}
```
#### logical-forms
- **Size of downloaded dataset files:** 15.97 MB
- **Size of the generated dataset:** 24.25 MB
- **Total amount of disk used:** 40.22 MB
An example of 'train' looks as follows.
```
{
"decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
"operators": "['select', 'filter', 'filter', 'filter', 'project']",
"program": "some program",
"question_id": "ATIS_dev_102",
"question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
"split": "dev"
}
```
### Data Fields
The data fields are the same among all splits.
#### QDMR
- `question_id`: a `string` feature.
- `question_text`: a `string` feature.
- `decomposition`: a `string` feature.
- `operators`: a `string` feature.
- `split`: a `string` feature.
#### QDMR-high-level
- `question_id`: a `string` feature.
- `question_text`: a `string` feature.
- `decomposition`: a `string` feature.
- `operators`: a `string` feature.
- `split`: a `string` feature.
#### QDMR-high-level-lexicon
- `source`: a `string` feature.
- `allowed_tokens`: a `string` feature.
#### QDMR-lexicon
- `source`: a `string` feature.
- `allowed_tokens`: a `string` feature.
#### logical-forms
- `question_id`: a `string` feature.
- `question_text`: a `string` feature.
- `decomposition`: a `string` feature.
- `operators`: a `string` feature.
- `split`: a `string` feature.
- `program`: a `string` feature.
### Data Splits
| name |train|validation|test|
|-----------------------|----:|---------:|---:|
|QDMR |44321| 7760|8069|
|QDMR-high-level |17503| 3130|3195|
|QDMR-high-level-lexicon|17503| 3130|3195|
|QDMR-lexicon |44321| 7760|8069|
|logical-forms |44098| 7719|8006|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@article{Wolfson2020Break,
title={Break It Down: A Question Understanding Benchmark},
author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
journal={Transactions of the Association for Computational Linguistics},
year={2020},
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
allenai
原始信息汇总
数据集概述
基本信息
- 数据集名称: BREAK
- 语言: 英语 (en)
- 许可证: 未知
- 多语言性: 单语
- 大小: 10K<n<100K
- 源数据: 原始数据
- 任务类别: 文本到文本生成
- 任务ID: 开放域抽象问答
数据集结构
配置名称: QDMR
- 特征:
question_id: 字符串question_text: 字符串decomposition: 字符串operators: 字符串split: 字符串
- 分割:
train: 44321 个示例, 12757200 字节validation: 7760 个示例, 2231632 字节test: 8069 个示例, 894558 字节
- 下载大小: 5175508 字节
- 数据集大小: 15883390 字节
配置名称: QDMR-high-level
- 特征:
question_id: 字符串question_text: 字符串decomposition: 字符串operators: 字符串split: 字符串
- 分割:
train: 17503 个示例, 5134938 字节validation: 3130 个示例, 912408 字节test: 3195 个示例, 479919 字节
- 下载大小: 3113187 字节
- 数据集大小: 6527265 字节
配置名称: QDMR-high-level-lexicon
- 特征:
source: 字符串allowed_tokens: 字符串
- 分割:
train: 17503 个示例, 23227946 字节validation: 3130 个示例, 4157495 字节test: 3195 个示例, 4239547 字节
- 下载大小: 5663924 字节
- 数据集大小: 31624988 字节
配置名称: QDMR-lexicon
- 特征:
source: 字符串allowed_tokens: 字符串
- 分割:
train: 44321 个示例, 56896433 字节validation: 7760 个示例, 9934015 字节test: 8069 个示例, 10328787 字节
- 下载大小: 10818266 字节
- 数据集大小: 77159235 字节
配置名称: logical-forms
- 特征:
question_id: 字符串question_text: 字符串decomposition: 字符串operators: 字符串split: 字符串program: 字符串
- 分割:
train: 44098 个示例, 19783061 字节validation: 7719 个示例, 3498114 字节test: 8006 个示例, 920007 字节
- 下载大小: 7572815 字节
- 数据集大小: 24201182 字节
数据分割
| 名称 | 训练 | 验证 | 测试 |
|---|---|---|---|
| QDMR | 44321 | 7760 | 8069 |
| QDMR-high-level | 17503 | 3130 | 3195 |
| QDMR-high-level-lexicon | 17503 | 3130 | 3195 |
| QDMR-lexicon | 44321 | 7760 | 8069 |
| logical-forms | 44098 | 7719 | 8006 |
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,问题理解与语义解析一直是研究的热点与难点。Break数据集正是为攻克这一挑战而构建的高质量资源,它源自于对文本、图像及数据库等10个问答数据集中的83,978条自然语言问题进行人工标注。该数据集的核心创新在于引入了问题分解意义表示(QDMR),通过众包方式招募标注员,将复杂的原始问题逐步拆解为一系列原子性子问题,并记录每一步所采用的语义操作符(如选择、过滤、投影等)。这种精细化的标注流程确保了每个问题都能被分解为结构清晰、逻辑连贯的步骤序列,从而为模型学习问题内部的语义结构提供了坚实的基础。
特点
Break数据集最显著的特点在于其多层次的表示体系与广泛的覆盖范围。它提供了从低阶到高阶的多种QDMR分解形式,包括标准QDMR、高阶QDMR以及对应的词汇约束文件,能够满足不同粒度语义解析任务的需求。此外,该数据集还包含了逻辑形式(logical-forms)配置,将问题分解直接映射为可执行的程序代码,为端到端的语义解析研究提供了桥梁。数据集规模适中,训练集包含约44,000个样本,验证集与测试集各约8,000个,且样本来源横跨多个领域与模态,使得模型能够学习到更具泛化性的问题分解能力。
使用方法
使用Break数据集进行模型训练与评估时,研究者可根据任务目标灵活选择配置。对于基础的问题分解任务,可直接加载QDMR配置,利用其中的question_text作为输入,decomposition作为目标序列进行序列到序列的生成式训练。若需引入词汇约束,可结合对应的lexicon配置使用。高阶分解与逻辑形式配置则适用于更复杂的语义解析场景。数据集已预划分为训练、验证和测试三部分,用户可通过Hugging Face Datasets库便捷地加载指定配置与划分,并利用标准的文本生成评价指标(如BLEU、ROUGE)来评估模型性能,从而推动问题理解与语义解析技术的进步。
背景与挑战
背景概述
在自然语言处理领域,复杂问题理解与推理一直是核心挑战。为了弥合自然语言与结构化查询之间的鸿沟,艾伦人工智能研究所(Allen Institute for AI)的研究人员Tomer Wolfson、Mor Geva、Ankit Gupta等人于2020年提出了BREAK数据集。该数据集旨在通过构建问题分解意义表示(QDMR),将自然语言问题逐步拆解为可解释的语义子步骤,从而为开放域抽象问答、文本生成任务提供结构化的中间表示。BREAK从文本、图像和数据库等10个问答数据集中采样了83,978个样本,并经过众包标注,形成了涵盖QDMR、高级分解及逻辑形式等多种配置的基准。其核心研究问题在于如何让模型学会将复杂问题自动分解为有序的操作序列,进而提升跨领域推理的可解释性与泛化能力。该数据集为后续的语义解析、多跳推理和可解释AI研究奠定了重要基础,并推动了问答系统从端到端黑箱向模块化理解的转变。
当前挑战
BREAK数据集所解决的领域问题聚焦于复杂问题的结构化理解与分解,其挑战在于:1)自然语言问题往往包含隐含的语义关系、多步推理逻辑以及跨模态信息,如何将模糊的表达精准映射为离散、有序的操作序列(如选择、过滤、投影)是核心难点;2)现有模型在处理长距离依赖和嵌套子问题时易出现语义丢失,导致分解结果与真实意图偏离。在构建过程中,挑战同样显著:1)来自10个不同源数据集的问题在领域、句式和粒度上差异巨大,需设计统一的标注规范以保证分解的一致性;2)众包标注员对复杂问题的理解存在主观偏差,如何通过质量控制与冲突消解确保标注的准确性与可复现性,是数据规模扩大后必须克服的难题。
常用场景
经典使用场景
在自然语言处理领域,BREAK数据集最经典的使用场景是作为问题分解与语义解析的基准资源。该数据集通过提供海量自然语言问题及其对应的问答分解含义表示(QDMR),为模型学习如何将复杂问题拆解为一系列有序的子问题提供了标准化的训练与评估平台。研究者通常利用其QDMR配置,训练序列到序列模型以自动生成问题分解结构,进而提升模型对多步骤推理任务的语义理解能力。
解决学术问题
BREAK数据集的核心学术贡献在于解决了复杂问题语义解析中缺乏可解释中间步骤的长期难题。传统问答系统往往直接映射问题到答案,难以捕捉推理过程。该数据集通过显式标注问题的分解逻辑,使模型能够学习分步推理范式,从而推动了可解释问答、语义解析泛化以及跨任务迁移学习等研究方向的发展,为构建具备透明推理能力的智能系统奠定了数据基础。
衍生相关工作
BREAK数据集衍生了一系列具有影响力的经典工作,包括基于其QDMR表示改进的语义解析器,如将分解步骤与逻辑形式生成结合的端到端模型。后续研究还探索了利用该数据集进行跨领域迁移学习,验证了问题分解表示在未见任务上的泛化能力。此外,部分工作将BREAK与知识图谱问答相结合,通过分解策略增强多跳推理的鲁棒性,催生了诸如分解感知的图神经网络等创新架构。
以上内容由遇见数据集搜集并总结生成



