five

tuetschek/e2e_nlg_cleaned

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tuetschek/e2e_nlg_cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text2text-generation task_ids: [] paperswithcode_id: null pretty_name: the Cleaned Version of the E2E Dataset tags: - meaning-representation-to-text dataset_info: features: - name: meaning_representation dtype: string - name: human_reference dtype: string splits: - name: train num_bytes: 7474936 num_examples: 33525 - name: validation num_bytes: 1056527 num_examples: 4299 - name: test num_bytes: 1262597 num_examples: 4693 download_size: 14597407 dataset_size: 9794060 --- # Dataset Card for the Cleaned Version of the E2E Dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [homepage](http://www.macs.hw.ac.uk/InteractionLab/E2E/) - **Repository:** [repository](https://github.com/tuetschek/e2e-dataset/) - **Paper:** [paper](https://arxiv.org/abs/1706.09254) - **Leaderboard:** [leaderboard](http://www.macs.hw.ac.uk/InteractionLab/E2E/) ### Dataset Summary An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper: The E2E dataset is used for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area. The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances. E2E is released in the following paper where you can find more details and baseline results: https://arxiv.org/abs/1706.09254 ### Supported Tasks and Leaderboards - `text2text-generation-other-meaning-representtion-to-text`: The dataset can be used to train a model to generate descriptions in the restaurant domain from meaning representations, which consists in taking as input some data about a restaurant and generate a sentence in natural language that presents the different aspects of the data about the restaurant.. Success on this task is typically measured by achieving a *high* [BLEU](https://huggingface.co/metrics/bleu), [NIST](https://huggingface.co/metrics/nist), [METEOR](https://huggingface.co/metrics/meteor), [Rouge-L](https://huggingface.co/metrics/rouge), [CIDEr](https://huggingface.co/metrics/cider). This task has an inactive leaderboard which can be found [here](http://www.macs.hw.ac.uk/InteractionLab/E2E/) and ranks models based on the metrics above. ### Languages The dataset is in english (en). ## Dataset Structure ### Data Instances Example of one instance: ``` {'human_reference': 'The Vaults pub near Café Adriatic has a 5 star rating. Prices start at £30.', 'meaning_representation': 'name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]'} ``` ### Data Fields - `human_reference`: string, the text is natural language that describes the different characteristics in the meaning representation - `meaning_representation`: list of slots and values to generate a description from Each MR consists of 3–8 attributes (slots), such as name, food or area, and their values. ### Data Splits The dataset is split into training, validation and testing sets (in a 76.5-8.5-15 ratio), keeping a similar distribution of MR and reference text lengths and ensuring that MRs in different sets are distinct. | | train | validation | test | |--------------|------:|-----------:|-----:| | N. Instances | 33525 | 4299 | 4693 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization The data was collected using the CrowdFlower platform and quality-controlled following Novikova et al. (2016). #### Who are the source language producers? [More Information Needed] ### Annotations Following Novikova et al. (2016), the E2E data was collected using pictures as stimuli, which was shown to elicit significantly more natural, more informative, and better phrased human references than textual MRs. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @article{dusek.etal2020:csl, title = {Evaluating the {{State}}-of-the-{{Art}} of {{End}}-to-{{End Natural Language Generation}}: {{The E2E NLG Challenge}}}, author = {Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena}, year = {2020}, month = jan, volume = {59}, pages = {123--156}, doi = {10.1016/j.csl.2019.06.009}, archivePrefix = {arXiv}, eprint = {1901.11528}, eprinttype = {arxiv}, journal = {Computer Speech \& Language} ``` ### Contributions Thanks to [@yjernite](https://github.com/yjernite) for adding this dataset.
提供机构:
tuetschek
原始信息汇总

数据集概述

名称: the Cleaned Version of the E2E Dataset

语言: 英语 (en)

许可证: cc-by-sa-4.0

多语言性: 单语种

大小类别: 10K<n<100K

源数据集: 原始数据

任务类别: text2text-generation

数据集信息:

  • 特征:

    • meaning_representation: 字符串类型,表示餐厅的各项属性及其值。
    • human_reference: 字符串类型,描述餐厅的自然语言文本。
  • 数据分割:

    • train: 33525个实例,7474936字节
    • validation: 4299个实例,1056527字节
    • test: 4693个实例,1262597字节
  • 下载大小: 14597407字节

  • 数据集大小: 9794060字节

数据集描述

数据集总结: 该数据集是E2E NLG挑战数据的更新版本,包含清洁后的MRs和脚本。主要用于训练端到端的数据驱动自然语言生成系统,专注于餐厅领域,其规模是现有常用数据集的十倍。

支持的任务和排行榜:

  • text2text-generation-other-meaning-representtion-to-text: 用于训练模型从意义表示生成餐厅描述的自然语言文本。成功标准通常通过BLEU、NIST、METEOR、Rouge-L、CIDEr等指标衡量。

数据集结构

数据实例: 示例包括human_referencemeaning_representation两个字段,分别表示自然语言描述和意义表示。

数据字段:

  • human_reference: 描述餐厅的自然语言文本。
  • meaning_representation: 包含3-8个属性(槽)及其值,用于生成描述。

数据分割: 数据集分为训练、验证和测试集,保持MR和参考文本长度的相似分布,并确保不同集合中的MRs是不同的。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作