five

allenai/break_data

收藏
Hugging Face2024-01-11 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/allenai/break_data
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text2text-generation task_ids: - open-domain-abstractive-qa paperswithcode_id: break pretty_name: BREAK dataset_info: - config_name: QDMR features: - name: question_id dtype: string - name: question_text dtype: string - name: decomposition dtype: string - name: operators dtype: string - name: split dtype: string splits: - name: train num_bytes: 12757200 num_examples: 44321 - name: validation num_bytes: 2231632 num_examples: 7760 - name: test num_bytes: 894558 num_examples: 8069 download_size: 5175508 dataset_size: 15883390 - config_name: QDMR-high-level features: - name: question_id dtype: string - name: question_text dtype: string - name: decomposition dtype: string - name: operators dtype: string - name: split dtype: string splits: - name: train num_bytes: 5134938 num_examples: 17503 - name: validation num_bytes: 912408 num_examples: 3130 - name: test num_bytes: 479919 num_examples: 3195 download_size: 3113187 dataset_size: 6527265 - config_name: QDMR-high-level-lexicon features: - name: source dtype: string - name: allowed_tokens dtype: string splits: - name: train num_bytes: 23227946 num_examples: 17503 - name: validation num_bytes: 4157495 num_examples: 3130 - name: test num_bytes: 4239547 num_examples: 3195 download_size: 5663924 dataset_size: 31624988 - config_name: QDMR-lexicon features: - name: source dtype: string - name: allowed_tokens dtype: string splits: - name: train num_bytes: 56896433 num_examples: 44321 - name: validation num_bytes: 9934015 num_examples: 7760 - name: test num_bytes: 10328787 num_examples: 8069 download_size: 10818266 dataset_size: 77159235 - config_name: logical-forms features: - name: question_id dtype: string - name: question_text dtype: string - name: decomposition dtype: string - name: operators dtype: string - name: split dtype: string - name: program dtype: string splits: - name: train num_bytes: 19783061 num_examples: 44098 - name: validation num_bytes: 3498114 num_examples: 7719 - name: test num_bytes: 920007 num_examples: 8006 download_size: 7572815 dataset_size: 24201182 configs: - config_name: QDMR data_files: - split: train path: QDMR/train-* - split: validation path: QDMR/validation-* - split: test path: QDMR/test-* - config_name: QDMR-high-level data_files: - split: train path: QDMR-high-level/train-* - split: validation path: QDMR-high-level/validation-* - split: test path: QDMR-high-level/test-* - config_name: QDMR-high-level-lexicon data_files: - split: train path: QDMR-high-level-lexicon/train-* - split: validation path: QDMR-high-level-lexicon/validation-* - split: test path: QDMR-high-level-lexicon/test-* - config_name: QDMR-lexicon data_files: - split: train path: QDMR-lexicon/train-* - split: validation path: QDMR-lexicon/validation-* - split: test path: QDMR-lexicon/test-* - config_name: logical-forms data_files: - split: train path: logical-forms/train-* - split: validation path: logical-forms/validation-* - split: test path: logical-forms/test-* --- # Dataset Card for "break_data" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/allenai/Break](https://github.com/allenai/Break) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 79.86 MB - **Size of the generated dataset:** 155.55 MB - **Total amount of disk used:** 235.39 MB ### Dataset Summary Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### QDMR - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 15.93 MB - **Total amount of disk used:** 31.90 MB An example of 'validation' looks as follows. ``` { "decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available", "operators": "['select', 'filter', 'filter', 'filter']", "question_id": "ATIS_dev_0", "question_text": "what flights are available tomorrow from denver to philadelphia ", "split": "dev" } ``` #### QDMR-high-level - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 6.54 MB - **Total amount of disk used:** 22.51 MB An example of 'train' looks as follows. ``` { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } ``` #### QDMR-high-level-lexicon - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 31.64 MB - **Total amount of disk used:** 47.61 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...", "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?" } ``` #### QDMR-lexicon - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 77.19 MB - **Total amount of disk used:** 93.16 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...", "source": "what flights are available tomorrow from denver to philadelphia " } ``` #### logical-forms - **Size of downloaded dataset files:** 15.97 MB - **Size of the generated dataset:** 24.25 MB - **Total amount of disk used:** 40.22 MB An example of 'train' looks as follows. ``` { "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "program": "some program", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" } ``` ### Data Fields The data fields are the same among all splits. #### QDMR - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. #### QDMR-high-level - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. #### QDMR-high-level-lexicon - `source`: a `string` feature. - `allowed_tokens`: a `string` feature. #### QDMR-lexicon - `source`: a `string` feature. - `allowed_tokens`: a `string` feature. #### logical-forms - `question_id`: a `string` feature. - `question_text`: a `string` feature. - `decomposition`: a `string` feature. - `operators`: a `string` feature. - `split`: a `string` feature. - `program`: a `string` feature. ### Data Splits | name |train|validation|test| |-----------------------|----:|---------:|---:| |QDMR |44321| 7760|8069| |QDMR-high-level |17503| 3130|3195| |QDMR-high-level-lexicon|17503| 3130|3195| |QDMR-lexicon |44321| 7760|8069| |logical-forms |44098| 7719|8006| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{Wolfson2020Break, title={Break It Down: A Question Understanding Benchmark}, author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan}, journal={Transactions of the Association for Computational Linguistics}, year={2020}, } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
allenai
原始信息汇总

数据集概述

基本信息

  • 数据集名称: BREAK
  • 语言: 英语 (en)
  • 许可证: 未知
  • 多语言性: 单语
  • 大小: 10K<n<100K
  • 源数据: 原始数据
  • 任务类别: 文本到文本生成
  • 任务ID: 开放域抽象问答

数据集结构

配置名称: QDMR
  • 特征:
    • question_id: 字符串
    • question_text: 字符串
    • decomposition: 字符串
    • operators: 字符串
    • split: 字符串
  • 分割:
    • train: 44321 个示例, 12757200 字节
    • validation: 7760 个示例, 2231632 字节
    • test: 8069 个示例, 894558 字节
  • 下载大小: 5175508 字节
  • 数据集大小: 15883390 字节
配置名称: QDMR-high-level
  • 特征:
    • question_id: 字符串
    • question_text: 字符串
    • decomposition: 字符串
    • operators: 字符串
    • split: 字符串
  • 分割:
    • train: 17503 个示例, 5134938 字节
    • validation: 3130 个示例, 912408 字节
    • test: 3195 个示例, 479919 字节
  • 下载大小: 3113187 字节
  • 数据集大小: 6527265 字节
配置名称: QDMR-high-level-lexicon
  • 特征:
    • source: 字符串
    • allowed_tokens: 字符串
  • 分割:
    • train: 17503 个示例, 23227946 字节
    • validation: 3130 个示例, 4157495 字节
    • test: 3195 个示例, 4239547 字节
  • 下载大小: 5663924 字节
  • 数据集大小: 31624988 字节
配置名称: QDMR-lexicon
  • 特征:
    • source: 字符串
    • allowed_tokens: 字符串
  • 分割:
    • train: 44321 个示例, 56896433 字节
    • validation: 7760 个示例, 9934015 字节
    • test: 8069 个示例, 10328787 字节
  • 下载大小: 10818266 字节
  • 数据集大小: 77159235 字节
配置名称: logical-forms
  • 特征:
    • question_id: 字符串
    • question_text: 字符串
    • decomposition: 字符串
    • operators: 字符串
    • split: 字符串
    • program: 字符串
  • 分割:
    • train: 44098 个示例, 19783061 字节
    • validation: 7719 个示例, 3498114 字节
    • test: 8006 个示例, 920007 字节
  • 下载大小: 7572815 字节
  • 数据集大小: 24201182 字节

数据分割

名称 训练 验证 测试
QDMR 44321 7760 8069
QDMR-high-level 17503 3130 3195
QDMR-high-level-lexicon 17503 3130 3195
QDMR-lexicon 44321 7760 8069
logical-forms 44098 7719 8006
搜集汇总
数据集介绍
main_image_url
构建方式
在自然语言处理领域,问题理解与语义解析一直是研究的热点与难点。Break数据集正是为攻克这一挑战而构建的高质量资源,它源自于对文本、图像及数据库等10个问答数据集中的83,978条自然语言问题进行人工标注。该数据集的核心创新在于引入了问题分解意义表示(QDMR),通过众包方式招募标注员,将复杂的原始问题逐步拆解为一系列原子性子问题,并记录每一步所采用的语义操作符(如选择、过滤、投影等)。这种精细化的标注流程确保了每个问题都能被分解为结构清晰、逻辑连贯的步骤序列,从而为模型学习问题内部的语义结构提供了坚实的基础。
特点
Break数据集最显著的特点在于其多层次的表示体系与广泛的覆盖范围。它提供了从低阶到高阶的多种QDMR分解形式,包括标准QDMR、高阶QDMR以及对应的词汇约束文件,能够满足不同粒度语义解析任务的需求。此外,该数据集还包含了逻辑形式(logical-forms)配置,将问题分解直接映射为可执行的程序代码,为端到端的语义解析研究提供了桥梁。数据集规模适中,训练集包含约44,000个样本,验证集与测试集各约8,000个,且样本来源横跨多个领域与模态,使得模型能够学习到更具泛化性的问题分解能力。
使用方法
使用Break数据集进行模型训练与评估时,研究者可根据任务目标灵活选择配置。对于基础的问题分解任务,可直接加载QDMR配置,利用其中的question_text作为输入,decomposition作为目标序列进行序列到序列的生成式训练。若需引入词汇约束,可结合对应的lexicon配置使用。高阶分解与逻辑形式配置则适用于更复杂的语义解析场景。数据集已预划分为训练、验证和测试三部分,用户可通过Hugging Face Datasets库便捷地加载指定配置与划分,并利用标准的文本生成评价指标(如BLEU、ROUGE)来评估模型性能,从而推动问题理解与语义解析技术的进步。
背景与挑战
背景概述
在自然语言处理领域,复杂问题理解与推理一直是核心挑战。为了弥合自然语言与结构化查询之间的鸿沟,艾伦人工智能研究所(Allen Institute for AI)的研究人员Tomer Wolfson、Mor Geva、Ankit Gupta等人于2020年提出了BREAK数据集。该数据集旨在通过构建问题分解意义表示(QDMR),将自然语言问题逐步拆解为可解释的语义子步骤,从而为开放域抽象问答、文本生成任务提供结构化的中间表示。BREAK从文本、图像和数据库等10个问答数据集中采样了83,978个样本,并经过众包标注,形成了涵盖QDMR、高级分解及逻辑形式等多种配置的基准。其核心研究问题在于如何让模型学会将复杂问题自动分解为有序的操作序列,进而提升跨领域推理的可解释性与泛化能力。该数据集为后续的语义解析、多跳推理和可解释AI研究奠定了重要基础,并推动了问答系统从端到端黑箱向模块化理解的转变。
当前挑战
BREAK数据集所解决的领域问题聚焦于复杂问题的结构化理解与分解,其挑战在于:1)自然语言问题往往包含隐含的语义关系、多步推理逻辑以及跨模态信息,如何将模糊的表达精准映射为离散、有序的操作序列(如选择、过滤、投影)是核心难点;2)现有模型在处理长距离依赖和嵌套子问题时易出现语义丢失,导致分解结果与真实意图偏离。在构建过程中,挑战同样显著:1)来自10个不同源数据集的问题在领域、句式和粒度上差异巨大,需设计统一的标注规范以保证分解的一致性;2)众包标注员对复杂问题的理解存在主观偏差,如何通过质量控制与冲突消解确保标注的准确性与可复现性,是数据规模扩大后必须克服的难题。
常用场景
经典使用场景
在自然语言处理领域,BREAK数据集最经典的使用场景是作为问题分解与语义解析的基准资源。该数据集通过提供海量自然语言问题及其对应的问答分解含义表示(QDMR),为模型学习如何将复杂问题拆解为一系列有序的子问题提供了标准化的训练与评估平台。研究者通常利用其QDMR配置,训练序列到序列模型以自动生成问题分解结构,进而提升模型对多步骤推理任务的语义理解能力。
解决学术问题
BREAK数据集的核心学术贡献在于解决了复杂问题语义解析中缺乏可解释中间步骤的长期难题。传统问答系统往往直接映射问题到答案,难以捕捉推理过程。该数据集通过显式标注问题的分解逻辑,使模型能够学习分步推理范式,从而推动了可解释问答、语义解析泛化以及跨任务迁移学习等研究方向的发展,为构建具备透明推理能力的智能系统奠定了数据基础。
衍生相关工作
BREAK数据集衍生了一系列具有影响力的经典工作,包括基于其QDMR表示改进的语义解析器,如将分解步骤与逻辑形式生成结合的端到端模型。后续研究还探索了利用该数据集进行跨领域迁移学习,验证了问题分解表示在未见任务上的泛化能力。此外,部分工作将BREAK与知识图谱问答相结合,通过分解策略增强多跳推理的鲁棒性,催生了诸如分解感知的图神经网络等创新架构。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作