Yale-LILY/dart|文本生成数据集|数据转换数据集

hugging_face2022-11-18 更新2024-05-25 收录

文本生成

数据转换

下载链接：

https://hf-mirror.com/datasets/Yale-LILY/dart

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - crowdsourced - machine-generated language_creators: - crowdsourced - machine-generated language: - en license: - mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|wikitable_questions - extended|wikisql - extended|web_nlg - extended|cleaned_e2e task_categories: - tabular-to-text task_ids: - rdf-to-text paperswithcode_id: dart pretty_name: DART dataset_info: features: - name: tripleset sequence: sequence: string - name: subtree_was_extended dtype: bool - name: annotations sequence: - name: source dtype: string - name: text dtype: string splits: - name: train num_bytes: 12966443 num_examples: 30526 - name: validation num_bytes: 1458106 num_examples: 2768 - name: test num_bytes: 2657644 num_examples: 5097 download_size: 29939366 dataset_size: 17082193 --- # Dataset Card for DART ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [homepahe](https://github.com/Yale-LILY/dart) - **Repository:** [github](https://github.com/Yale-LILY/dart) - **Paper:** [paper](https://arxiv.org/abs/2007.02871) - **Leaderboard:** [leaderboard](https://github.com/Yale-LILY/dart#leaderboard) ### Dataset Summary DART is a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora. ### Supported Tasks and Leaderboards The task associated to DART is text generation from data records that are RDF triplets: - `rdf-to-text`: The dataset can be used to train a model for text generation from RDF triplets, which consists in generating textual description of structured data. Success on this task is typically measured by achieving a *high* [BLEU](https://huggingface.co/metrics/bleu), [METEOR](https://huggingface.co/metrics/meteor), [BLEURT](https://huggingface.co/metrics/bleurt), [TER](https://huggingface.co/metrics/ter), [MoverScore](https://huggingface.co/metrics/mover_score), and [BERTScore](https://huggingface.co/metrics/bert_score). The ([BART-large model](https://huggingface.co/facebook/bart-large) from [BART](https://huggingface.co/transformers/model_doc/bart.html)) model currently achieves the following scores: | | BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT | | ----- | ----- | ------ | ---- | ----------- | ---------- | ------ | | BART | 37.06 | 0.36 | 0.57 | 0.44 | 0.92 | 0.22 | This task has an active leaderboard which can be found [here](https://github.com/Yale-LILY/dart#leaderboard) and ranks models based on the above metrics while also reporting. ### Languages The dataset is in english (en). ## Dataset Structure ### Data Instances Here is an example from the dataset: ``` {'annotations': {'source': ['WikiTableQuestions_mturk'], 'text': ['First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville']}, 'subtree_was_extended': False, 'tripleset': [['First Clearing', 'LOCATION', 'On NYS 52 1 Mi. Youngsville'], ['On NYS 52 1 Mi. Youngsville', 'CITY_OR_TOWN', 'Callicoon, New York']]} ``` It contains one annotation where the textual description is 'First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville'. The RDF triplets considered to generate this description are in tripleset and are formatted as subject, predicate, object. ### Data Fields The different fields are: - `annotations`: - `text`: list of text descriptions of the triplets - `source`: list of sources of the RDF triplets (WikiTable, e2e, etc.) - `subtree_was_extended`: boolean, if the subtree condidered during the dataset construction was extended. Sometimes this field is missing, and therefore set to `None` - `tripleset`: RDF triplets as a list of triplets of strings (subject, predicate, object) ### Data Splits There are three splits, train, validation and test: | | train | validation | test | | ----- |------:|-----------:|-----:| | N. Examples | 30526 | 2768 | 6959 | ## Dataset Creation ### Curation Rationale Automatically generating textual descriptions from structured data inputs is crucial to improving the accessibility of knowledge bases to lay users. ### Source Data DART comes from existing datasets that cover a variety of different domains while allowing to build a tree ontology and form RDF triple sets as semantic representations. The datasets used are WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E. #### Initial Data Collection and Normalization DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019) #### Who are the source language producers? [More Information Needed] ### Annotations DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019) #### Annotation process The two stage annotation process for constructing tripleset sentence pairs is based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Under MIT license (see [here](https://github.com/Yale-LILY/dart/blob/master/LICENSE)) ### Citation Information ``` @article{radev2020dart, title={DART: Open-Domain Structured Data Record to Text Generation}, author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher}, journal={arXiv preprint arXiv:2007.02871}, year={2020} ``` ### Contributions Thanks to [@lhoestq](https://github.com/lhoestq) for adding this dataset.

提供机构：

Yale-LILY

原始信息汇总

数据集概述

名称: DART

语言: 英语 (en)

许可: MIT

多语言性: 单语

大小类别: 10K<n<100K

源数据集:

扩展自 WikiTableQuestions
扩展自 WikiSQL
扩展自 WebNLG
扩展自 Cleaned E2E

任务类别: 表格到文本

任务ID: rdf-to-text

数据集信息:

特征:
- tripleset: 字符串序列，表示RDF三元组。
- subtree_was_extended: 布尔类型，表示子树是否被扩展。
- annotations: 序列，包含：
  - source: 字符串，源数据。
  - text: 字符串，文本描述。
数据分割:
- train: 30526个样本，12966443字节。
- validation: 2768个样本，1458106字节。
- test: 5097个样本，2657644字节。
下载大小: 29939366字节
数据集大小: 17082193字节

数据集创建

来源数据:

WikiTableQuestions
WikiSQL
WebNLG
Cleaned E2E

注释过程:

使用树结构的本体对每个表格进行注释。
首先由内部熟练的注释者标记每个列标题的父列。
然后，更多的注释者提供表格行中自动选择的单元格的句子描述。

许可证信息:

MIT许可证，详情见此处。

引用信息:

@article{radev2020dart, title={DART: Open-Domain Structured Data Record to Text Generation}, author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher}, journal={arXiv preprint arXiv:2007.02871}, year={2020} }

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

yahoo-finance-data

该数据集包含从Yahoo! Finance、Nasdaq和U.S. Department of the Treasury获取的财务数据，旨在用于研究和教育目的。数据集包括公司详细信息、高管信息、财务指标、历史盈利、股票价格、股息事件、股票拆分、汇率和每日国债收益率等。每个数据集都有其来源、简要描述以及列出的列及其数据类型和描述。数据定期更新，并以Parquet格式提供，可通过DuckDB进行查询。

huggingface 收录

Apple Stock Price Data

Historical stock price data for AAPL (apple)

kaggle 收录

MultiTalk

MultiTalk数据集是由韩国科学技术院创建，包含超过420小时的2D视频，涵盖20种不同语言，旨在解决多语言环境下3D说话头生成的问题。该数据集通过自动化管道从YouTube收集，每段视频都配有语言标签和伪转录，部分视频还包含伪3D网格顶点。数据集的创建过程包括视频收集、主动说话者验证和正面人脸验证，确保数据质量。MultiTalk数据集的应用领域主要集中在提升多语言3D说话头生成的准确性和表现力，通过引入语言特定风格嵌入，使模型能够捕捉每种语言独特的嘴部运动。

arXiv 收录

Yahoo Finance

Dataset About finance related to stock market

kaggle 收录

全国 1∶200 000 数字地质图（公开版）空间数据库

As the only one of its kind, China National Digital Geological Map (Public Version at 1∶200 000 scale) Spatial Database (CNDGM-PVSD) is based on China' s former nationwide measured results of regional geological survey at 1∶200 000 scale, and is also one of the nationwide basic geosciences spatial databases jointly accomplished by multiple organizations of China. Spatially, it embraces 1 163 geological map-sheets (at scale 1: 200 000) in both formats of MapGIS and ArcGIS, covering 72% of China's whole territory with a total data volume of 90 GB. Its main sources is from 1∶200 000 regional geological survey reports, geological maps, and mineral resources maps with an original time span from mid-1950s to early 1990s. Approved by the State's related agencies, it meets all the related technical qualification requirements and standards issued by China Geological Survey in data integrity, logic consistency, location acc racy, attribution fineness, and collation precision, and is hence of excellent and reliable quality. The CNDGM-PVSD is an important component of China' s national spatial database categories, serving as a spatial digital platform for the information construction of the State's national economy, and providing informationbackbones to the national and provincial economic planning, geohazard monitoring, geological survey, mineral resources exploration as well as macro decision-making.

DataCite Commons 收录