tuetschek/e2e_nlg_cleaned
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tuetschek/e2e_nlg_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- en
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text2text-generation
task_ids: []
paperswithcode_id: null
pretty_name: the Cleaned Version of the E2E Dataset
tags:
- meaning-representation-to-text
dataset_info:
features:
- name: meaning_representation
dtype: string
- name: human_reference
dtype: string
splits:
- name: train
num_bytes: 7474936
num_examples: 33525
- name: validation
num_bytes: 1056527
num_examples: 4299
- name: test
num_bytes: 1262597
num_examples: 4693
download_size: 14597407
dataset_size: 9794060
---
# Dataset Card for the Cleaned Version of the E2E Dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [homepage](http://www.macs.hw.ac.uk/InteractionLab/E2E/)
- **Repository:** [repository](https://github.com/tuetschek/e2e-dataset/)
- **Paper:** [paper](https://arxiv.org/abs/1706.09254)
- **Leaderboard:** [leaderboard](http://www.macs.hw.ac.uk/InteractionLab/E2E/)
### Dataset Summary
An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper:
The E2E dataset is used for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area.
The E2E dataset poses new challenges:
(1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena;
(2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances.
E2E is released in the following paper where you can find more details and baseline results:
https://arxiv.org/abs/1706.09254
### Supported Tasks and Leaderboards
- `text2text-generation-other-meaning-representtion-to-text`: The dataset can be used to train a model to generate descriptions in the restaurant domain from meaning representations, which consists in taking as input some data about a restaurant and generate a sentence in natural language that presents the different aspects of the data about the restaurant.. Success on this task is typically measured by achieving a *high* [BLEU](https://huggingface.co/metrics/bleu), [NIST](https://huggingface.co/metrics/nist), [METEOR](https://huggingface.co/metrics/meteor), [Rouge-L](https://huggingface.co/metrics/rouge), [CIDEr](https://huggingface.co/metrics/cider).
This task has an inactive leaderboard which can be found [here](http://www.macs.hw.ac.uk/InteractionLab/E2E/) and ranks models based on the metrics above.
### Languages
The dataset is in english (en).
## Dataset Structure
### Data Instances
Example of one instance:
```
{'human_reference': 'The Vaults pub near Café Adriatic has a 5 star rating. Prices start at £30.',
'meaning_representation': 'name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]'}
```
### Data Fields
- `human_reference`: string, the text is natural language that describes the different characteristics in the meaning representation
- `meaning_representation`: list of slots and values to generate a description from
Each MR consists of 3–8 attributes (slots), such as name, food or area, and their values.
### Data Splits
The dataset is split into training, validation and testing sets (in a 76.5-8.5-15 ratio), keeping a similar distribution of MR and reference text lengths and ensuring that MRs in different sets are distinct.
| | train | validation | test |
|--------------|------:|-----------:|-----:|
| N. Instances | 33525 | 4299 | 4693 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
The data was collected using the CrowdFlower platform and quality-controlled following Novikova et al. (2016).
#### Who are the source language producers?
[More Information Needed]
### Annotations
Following Novikova et al. (2016), the E2E data was collected using pictures as stimuli, which was shown to elicit significantly more natural, more informative, and better phrased human references than textual MRs.
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@article{dusek.etal2020:csl,
title = {Evaluating the {{State}}-of-the-{{Art}} of {{End}}-to-{{End Natural Language Generation}}: {{The E2E NLG Challenge}}},
author = {Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
year = {2020},
month = jan,
volume = {59},
pages = {123--156},
doi = {10.1016/j.csl.2019.06.009},
archivePrefix = {arXiv},
eprint = {1901.11528},
eprinttype = {arxiv},
journal = {Computer Speech \& Language}
```
### Contributions
Thanks to [@yjernite](https://github.com/yjernite) for adding this dataset.
提供机构:
tuetschek
原始信息汇总
数据集概述
名称: the Cleaned Version of the E2E Dataset
语言: 英语 (en)
许可证: cc-by-sa-4.0
多语言性: 单语种
大小类别: 10K<n<100K
源数据集: 原始数据
任务类别: text2text-generation
数据集信息:
-
特征:
meaning_representation: 字符串类型,表示餐厅的各项属性及其值。human_reference: 字符串类型,描述餐厅的自然语言文本。
-
数据分割:
train: 33525个实例,7474936字节validation: 4299个实例,1056527字节test: 4693个实例,1262597字节
-
下载大小: 14597407字节
-
数据集大小: 9794060字节
数据集描述
数据集总结: 该数据集是E2E NLG挑战数据的更新版本,包含清洁后的MRs和脚本。主要用于训练端到端的数据驱动自然语言生成系统,专注于餐厅领域,其规模是现有常用数据集的十倍。
支持的任务和排行榜:
text2text-generation-other-meaning-representtion-to-text: 用于训练模型从意义表示生成餐厅描述的自然语言文本。成功标准通常通过BLEU、NIST、METEOR、Rouge-L、CIDEr等指标衡量。
数据集结构
数据实例:
示例包括human_reference和meaning_representation两个字段,分别表示自然语言描述和意义表示。
数据字段:
human_reference: 描述餐厅的自然语言文本。meaning_representation: 包含3-8个属性(槽)及其值,用于生成描述。
数据分割: 数据集分为训练、验证和测试集,保持MR和参考文本长度的相似分布,并确保不同集合中的MRs是不同的。



