totto
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/totto
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for ToTTo
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** None
- **Repository:** https://github.com/google-research-datasets/ToTTo
- **Paper:** https://arxiv.org/abs/2004.14373
- **Leaderboard:** https://github.com/google-research-datasets/ToTTo#leaderboard
- **Point of Contact:** [totto@google.com](mailto:totto@google.com)
### Dataset Summary
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled
generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
A sample training set is provided below
```
{'example_id': '1762238357686640028',
'highlighted_cells': [[13, 2]],
'id': 0,
'overlap_subset': 'none',
'sentence_annotations': {'final_sentence': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
'original_sentence': ['It is also the first telenovela by the writer to air in the 9 pm timeslot.'],
'sentence_after_ambiguity': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
'sentence_after_deletion': ['It is the telenovela air in the 9 pm timeslot.']},
'table': [[{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': '#'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Run'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Title'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Chapters'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Author'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Director'},
{'column_span': 1,
'is_header': True,
'row_span': 1,
'value': 'Ibope Rating'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '59'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'June 5, 2000— February 2, 2001'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Laços de Família'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Manoel Carlos'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Ricardo Waddington'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '44.9'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '60'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'February 5, 2001— September 28, 2001'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Porto dos Milagres'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Aguinaldo Silva Ricardo Linhares'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Marcos Paulo Simões'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '44.6'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '61'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'October 1, 2001— June 14, 2002'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'O Clone'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Glória Perez'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Jayme Monjardim'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '47.0'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '62'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'June 17, 2002— February 14, 2003'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Esperança'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Benedito Ruy Barbosa'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Luiz Fernando'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '37.7'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '63'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'February 17, 2003— October 10, 2003'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Mulheres Apaixonadas'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Manoel Carlos'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Ricardo Waddington'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.6'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '64'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'October 13, 2003— June 25, 2004'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Celebridade'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Gilberto Braga'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Dennis Carvalho'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.0'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '65'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'June 28, 2004— March 11, 2005'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Senhora do Destino'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Aguinaldo Silva'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Wolf Maya'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '50.4'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '66'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'March 14, 2005— November 4, 2005'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'América'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Glória Perez'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Jayme Monjardim Marcos Schechtman'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '49.4'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '67'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'November 7, 2005— July 7, 2006'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Belíssima'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Sílvio de Abreu'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Denise Saraceni'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '48.5'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '68'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'July 10, 2006— March 2, 2007'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Páginas da Vida'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Manoel Carlos'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Jayme Monjardim'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.8'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '69'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'March 5, 2007— September 28, 2007'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Paraíso Tropical'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '179'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Gilberto Braga Ricardo Linhares'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Dennis Carvalho'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '42.8'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '70'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'October 1, 2007— May 31, 2008'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Duas Caras'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '210'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Aguinaldo Silva'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Wolf Maya'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '41.1'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '71'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'June 2, 2008— January 16, 2009'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'A Favorita'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '197'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'João Emanuel Carneiro'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Ricardo Waddington'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '39.5'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '72'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'January 19, 2009— September 11, 2009'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Caminho das Índias'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Glória Perez'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Marcos Schechtman'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '38.8'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '73'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'September 14, 2009— May 14, 2010'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Viver a Vida'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Manoel Carlos'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Jayme Monjardim'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '35.6'}]],
'table_page_title': 'List of 8/9 PM telenovelas of Rede Globo',
'table_section_text': '',
'table_section_title': '2000s',
'table_webpage_url': 'http://en.wikipedia.org/wiki/List_of_8/9_PM_telenovelas_of_Rede_Globo'}
```
Please note that in test set sentence annotations are not available and thus values inside `sentence_annotations` can be safely ignored.
### Data Fields
- `table_webpage_url` (`str`): Table webpage URL.
- `table_page_title` (`str`): Table metadata with context about the table.
- `table_section_title` (`str`): Table metadata with context about the table.
- `table_section_text` (`str`): Table metadata with context about the table.
- `table` (`List[List[Dict]]`): The outer lists represents rows and the inner lists columns. Each Dict has the fields:
- `column_span` (`int`)
- `is_header` (`bool`)
- `row_span` (`int`)
- `value` (`str`)
- `highlighted_cells` (`List[[row_index, column_index]]`): Where each `[row_index, column_index]` pair indicates that `table[row_index][column_index]` is highlighted.
- `example_id` (`int`): A unique id for this example.
- `sentence_annotations`: Consists of the `original_sentence` and the sequence of revised sentences performed in order to produce the `final_sentence`.
### Data Splits
```
DatasetDict({
train: Dataset({
features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
num_rows: 120761
})
validation: Dataset({
features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
num_rows: 7700
})
test: Dataset({
features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
num_rows: 7700
})
})
```
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@inproceedings{parikh2020totto,
title={{ToTTo}: A Controlled Table-To-Text Generation Dataset},
author={Parikh, Ankur P and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of EMNLP},
year={2020}
}
```
### Contributions
Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.
# ToTTo 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:无
- **代码仓库**:https://github.com/google-research-datasets/ToTTo
- **相关论文**:https://arxiv.org/abs/2004.14373
- **排行榜**:https://github.com/google-research-datasets/ToTTo#leaderboard
- **联系方式**:[totto@google.com](mailto:totto@google.com)
### 数据集概述
ToTTo 是一个开放域英文表格转文本(table-to-text)数据集,包含超过12万个训练样本,该数据集提出了一项可控生成任务:给定一份维基百科表格与一组高亮的表格单元格,生成一句单句描述。
### 支持任务与排行榜
[需补充更多信息]
### 语言
[需补充更多信息]
## 数据集结构
### 数据实例
以下提供一个训练集样本
{'example_id': '1762238357686640028',
'highlighted_cells': [[13, 2]],
'id': 0,
'overlap_subset': 'none',
'sentence_annotations': {'final_sentence': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
'original_sentence': ['It is also the first telenovela by the writer to air in the 9 pm timeslot.'],
'sentence_after_ambiguity': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
'sentence_after_deletion': ['It is the telenovela air in the 9 pm timeslot.']},
'table': [[{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': '#'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Run'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Title'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Chapters'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Author'},
{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Director'},
{'column_span': 1,
'is_header': True,
'row_span': 1,
'value': 'Ibope Rating'}],
[{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '59'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'June 5, 2000— February 2, 2001'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Laços de Família'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Manoel Carlos'},
{'column_span': 1,
'is_header': False,
'row_span': 1,
'value': 'Ricardo Waddington'},
{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '44.9'}],
...(省略部分表格内容)
]},
'table_page_title': 'List of 8/9 PM telenovelas of Rede Globo',
'table_section_text': '',
'table_section_title': '2000s',
'table_webpage_url': 'http://en.wikipedia.org/wiki/List_of_8/9_PM_telenovelas_of_Rede_Globo'}
请注意,测试集不提供句子标注,因此可安全忽略`sentence_annotations`字段的内容。
### 数据字段
- `table_webpage_url`(`str`类型):表格所属网页的URL。
- `table_page_title`(`str`类型):表格页面标题,属于表格的上下文元数据。
- `table_section_title`(`str`类型):表格分区标题,属于表格的上下文元数据。
- `table_section_text`(`str`类型):表格分区文本,属于表格的上下文元数据。
- `table`(`List[List[Dict]]`类型):外层列表代表表格行,内层列表代表表格列。每个字典包含以下字段:
- `column_span`(`int`类型):列跨度
- `is_header`(`bool`类型):是否为表头单元格
- `row_span`(`int`类型):行跨度
- `value`(`str`类型):单元格内容值
- `highlighted_cells`(`List[[row_index, column_index]]`类型):高亮单元格列表,其中每一组`[row_index, column_index]`表示`table[row_index][column_index]`为高亮单元格。
- `example_id`(`int`类型):当前示例的唯一标识符。
- `sentence_annotations`:包含`original_sentence`(原始句子)以及为生成`final_sentence`(最终句子)所执行的一系列修订后的句子。
### 数据划分
数据集字典({
训练集: 数据集({
特征: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
样本数量: 120761
}),
验证集: 数据集({
特征: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
样本数量: 7700
}),
测试集: 数据集({
特征: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
样本数量: 7700
})
})
## 数据集构建
### 数据集构建初衷
[需补充更多信息]
### 源数据
[需补充更多信息]
#### 初始数据收集与归一化
[需补充更多信息]
#### 源语言生成者是谁?
[需补充更多信息]
### 标注信息
[需补充更多信息]
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集整理者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
@inproceedings{parikh2020totto,
title={{ToTTo}: A Controlled Table-To-Text Generation Dataset},
author={Parikh, Ankur P and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of EMNLP},
year={2020}
}
### 贡献
感谢 [@abhishekkrthakur](https://github.com/abhishekkrthakur) 为本数据集的添加工作作出贡献。
提供机构:
maas
创建时间:
2025-07-07



