---
annotations_creators:
- crowdsourced
- machine-generated
language_creators:
- crowdsourced
- machine-generated
language:
- en
license:
- mit
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- extended|wikitable_questions
- extended|wikisql
- extended|web_nlg
- extended|cleaned_e2e
task_categories:
- tabular-to-text
task_ids:
- rdf-to-text
paperswithcode_id: dart
pretty_name: DART
dataset_info:
features:
- name: tripleset
sequence:
sequence: string
- name: subtree_was_extended
dtype: bool
- name: annotations
sequence:
- name: source
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12966443
num_examples: 30526
- name: validation
num_bytes: 1458106
num_examples: 2768
- name: test
num_bytes: 2657644
num_examples: 5097
download_size: 29939366
dataset_size: 17082193
---
# Dataset Card for DART
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [homepahe](https://github.com/Yale-LILY/dart)
- **Repository:** [github](https://github.com/Yale-LILY/dart)
- **Paper:** [paper](https://arxiv.org/abs/2007.02871)
- **Leaderboard:** [leaderboard](https://github.com/Yale-LILY/dart#leaderboard)
### Dataset Summary
DART is a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora.
### Supported Tasks and Leaderboards
The task associated to DART is text generation from data records that are RDF triplets:
- `rdf-to-text`: The dataset can be used to train a model for text generation from RDF triplets, which consists in generating textual description of structured data. Success on this task is typically measured by achieving a *high* [BLEU](https://huggingface.co/metrics/bleu), [METEOR](https://huggingface.co/metrics/meteor), [BLEURT](https://huggingface.co/metrics/bleurt), [TER](https://huggingface.co/metrics/ter), [MoverScore](https://huggingface.co/metrics/mover_score), and [BERTScore](https://huggingface.co/metrics/bert_score). The ([BART-large model](https://huggingface.co/facebook/bart-large) from [BART](https://huggingface.co/transformers/model_doc/bart.html)) model currently achieves the following scores:
| | BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT |
| ----- | ----- | ------ | ---- | ----------- | ---------- | ------ |
| BART | 37.06 | 0.36 | 0.57 | 0.44 | 0.92 | 0.22 |
This task has an active leaderboard which can be found [here](https://github.com/Yale-LILY/dart#leaderboard) and ranks models based on the above metrics while also reporting.
### Languages
The dataset is in english (en).
## Dataset Structure
### Data Instances
Here is an example from the dataset:
```
{'annotations': {'source': ['WikiTableQuestions_mturk'],
'text': ['First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville']},
'subtree_was_extended': False,
'tripleset': [['First Clearing', 'LOCATION', 'On NYS 52 1 Mi. Youngsville'],
['On NYS 52 1 Mi. Youngsville', 'CITY_OR_TOWN', 'Callicoon, New York']]}
```
It contains one annotation where the textual description is 'First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville'. The RDF triplets considered to generate this description are in tripleset and are formatted as subject, predicate, object.
### Data Fields
The different fields are:
- `annotations`:
- `text`: list of text descriptions of the triplets
- `source`: list of sources of the RDF triplets (WikiTable, e2e, etc.)
- `subtree_was_extended`: boolean, if the subtree condidered during the dataset construction was extended. Sometimes this field is missing, and therefore set to `None`
- `tripleset`: RDF triplets as a list of triplets of strings (subject, predicate, object)
### Data Splits
There are three splits, train, validation and test:
| | train | validation | test |
| ----- |------:|-----------:|-----:|
| N. Examples | 30526 | 2768 | 6959 |
## Dataset Creation
### Curation Rationale
Automatically generating textual descriptions from structured data inputs is crucial to improving the accessibility of knowledge bases to lay users.
### Source Data
DART comes from existing datasets that cover a variety of different domains while allowing to build a tree ontology and form RDF triple sets as semantic representations. The datasets used are WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E.
#### Initial Data Collection and Normalization
DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables
from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019)
#### Who are the source language producers?
[More Information Needed]
### Annotations
DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables
from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019)
#### Annotation process
The two stage annotation process for constructing tripleset sentence pairs is based on a tree-structured ontology of each table.
First, internal skilled annotators denote the parent column for each column header.
Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row.
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Under MIT license (see [here](https://github.com/Yale-LILY/dart/blob/master/LICENSE))
### Citation Information
```
@article{radev2020dart,
title={DART: Open-Domain Structured Data Record to Text Generation},
author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher},
journal={arXiv preprint arXiv:2007.02871},
year={2020}
```
### Contributions
Thanks to [@lhoestq](https://github.com/lhoestq) for adding this dataset.
annotations_creators:
- 众包
- 机器生成
language_creators:
- 众包
- 机器生成
language:
- en
license:
- mit
multilinguality:
- 单语言
size_categories:
- 10K<n<100K
source_datasets:
- extended|wikitable_questions
- extended|wikisql
- extended|web_nlg
- extended|cleaned_e2e
task_categories:
- 表格到文本(tabular-to-text)
task_ids:
- RDF到文本(rdf-to-text)
paperswithcode_id: dart
pretty_name: DART
dataset_info:
features:
- name: tripleset
sequence:
sequence: string
- name: subtree_was_extended
dtype: bool
- name: annotations
sequence:
- name: source
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12966443
num_examples: 30526
- name: validation
num_bytes: 1458106
num_examples: 2768
- name: test
num_bytes: 2657644
num_examples: 5097
download_size: 29939366
dataset_size: 17082193
---
# DART数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [注释](#注释)
- [个人与敏感信息](#个人与敏感信息)
- [数据使用注意事项](#数据使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可证信息](#许可证信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **Homepage:** [主页](https://github.com/Yale-LILY/dart)
- **Repository:** [GitHub仓库](https://github.com/Yale-LILY/dart)
- **Paper:** [论文链接](https://arxiv.org/abs/2007.02871)
- **Leaderboard:** [排行榜链接](https://github.com/Yale-LILY/dart#leaderboard)
### 数据集概述
DART是一款面向开放域结构化数据记录到文本生成任务的大型数据集。我们将结构化数据记录输入视为**RDF实体-关系三元组(Resource Description Framework Triple)**集合,这是一种广泛应用于知识表示与语义描述的标准格式。DART包含跨多个领域的82191个样本,每个输入均为源自表格数据记录与模式树本体的语义RDF三元组集合,并配有覆盖三元组集合中全部事实的自然语言描述句子。这种分层结构化格式结合开放域特性,使DART区别于现有其他表格到文本语料库。
### 支持任务与排行榜
与DART关联的任务为基于RDF三元组的数据记录到文本生成:
- `rdf-to-text`:本数据集可用于训练RDF三元组到文本的生成模型,该任务的目标是为结构化数据生成自然语言描述。该任务的性能通常通过以下指标进行评估:[BLEU(BiLingual Evaluation Understudy)](https://huggingface.co/metrics/bleu)、[METEOR](https://huggingface.co/metrics/meteor)、[BLEURT](https://huggingface.co/metrics/bleurt)、[TER(Translation Error Rate)](https://huggingface.co/metrics/ter)、[MoverScore](https://huggingface.co/metrics/mover_score)以及[BERTScore](https://huggingface.co/metrics/bert_score)。当前来自[BART(Bidirectional and Auto-Regressive Transformers)](https://huggingface.co/transformers/model_doc/bart.html)的[BART-large模型](https://huggingface.co/facebook/bart-large)取得了如下指标得分:
| | BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT |
| ----- | ----- | ------ | ---- | ----------- | ---------- | ------ |
| BART | 37.06 | 0.36 | 0.57 | 0.44 | 0.92 | 0.22 |
该任务设有活跃排行榜,可在此[查看](https://github.com/Yale-LILY/dart#leaderboard),排行榜基于上述指标对模型进行排名并汇报实验结果。
### 语言
本数据集使用英语(en)。
## 数据集结构
### 数据实例
以下为数据集中的一个示例:
{'annotations': {'source': ['WikiTableQuestions_mturk'],
'text': ['First Clearing based on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville']},
'subtree_was_extended': False,
'tripleset': [['First Clearing', 'LOCATION', 'On NYS 52 1 Mi. Youngsville'],
['On NYS 52 1 Mi. Youngsville', 'CITY_OR_TOWN', 'Callicoon, New York']]}
该示例包含一条注释,其文本描述为`First Clearing based on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville`。用于生成该描述的RDF三元组存储于`tripleset`字段中,格式为(主语,谓语,宾语)。
### 数据字段
各字段说明如下:
- `annotations`:注释字段,包含以下子字段:
- `text`:三元组集合的文本描述列表
- `source`:RDF三元组的来源列表(例如WikiTable、e2e等)
- `subtree_was_extended`:布尔类型字段,用于标识数据集构建过程中考虑的子树是否被扩展。部分样本中该字段缺失,此时会被设为`None`。
- `tripleset`:RDF三元组集合,以(主语,谓语,宾语)形式的字符串三元组列表存储。
### 数据划分
本数据集包含训练集、验证集与测试集三个划分:
| | 训练集 | 验证集 | 测试集 |
| ----- |------:|-----------:|-----:|
| 样本数量 | 30526 | 2768 | 6959 |
## 数据集构建
### 构建初衷
为提升普通用户对知识库的可访问性,从结构化数据输入自动生成自然语言描述的技术至关重要。
### 源数据
DART源自多个覆盖不同领域的现有数据集,这些数据集可用于构建树状本体并生成作为语义表示的RDF三元组集合。本次构建所用的源数据集包括WikiTableQuestions、WikiSQL、WebNLG以及Cleaned E2E。
#### 初始数据收集与标准化
DART通过多种互补方法构建:
1. 对来自WikiTableQuestions(Pasupat与Liang, 2015)与WikiSQL(Zhong等人, 2017)的开放域维基百科表格进行人工注释;
2. 将WikiSQL中的自然语言问句自动转换为陈述句;
3. 整合现有公开数据集,包括WebNLG 2017(Gardent等人, 2017a,b; Shimorina与Gardent, 2018)以及Cleaned E2E(Novikova等人, 2017b; Dušek等人, 2018, 2019)。
#### 源文本的生产者是谁?
[需要更多信息]
### 注释
DART的构建采用了多种互补方法:
1. 对来自WikiTableQuestions(Pasupat与Liang, 2015)与WikiSQL(Zhong等人, 2017)的开放域维基百科表格进行人工注释;
2. 将WikiSQL中的自然语言问句自动转换为陈述句;
3. 整合现有公开数据集,包括WebNLG 2017(Gardent等人, 2017a,b; Shimorina与Gardent, 2018)以及Cleaned E2E(Novikova等人, 2017b; Dušek等人, 2018, 2019)。
#### 注释流程
构建三元组-句子对的两阶段注释流程基于每个表格的树状结构本体。首先,内部专业标注人员为每个列标题指定父列;随后,更多标注人员为自动选取的部分表格行单元格生成对应的自然语言描述句子。
#### 标注人员来自何处?
[需要更多信息]
### 个人与敏感信息
[需要更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需要更多信息]
### 偏差讨论
[需要更多信息]
### 其他已知局限性
[需要更多信息]
## 附加信息
### 数据集维护者
[需要更多信息]
### 许可证信息
本数据集采用MIT许可证,详情可参见[此处](https://github.com/Yale-LILY/dart/blob/master/LICENSE)。
### 引用信息
@article{radev2020dart,
title={DART: Open-Domain Structured Data Record to Text Generation},
author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher},
journal={arXiv preprint arXiv:2007.02871},
year={2020}
}
### 贡献致谢
感谢[@lhoestq](https://github.com/lhoestq)为本数据集添加的相关贡献。