GEM/TaTA
收藏数据集概述
数据集描述
数据集摘要
TaTA 是一个针对非洲语言的多语种表格到文本生成数据集,包含 8,700 个示例,涵盖九种语言,包括四种非洲语言(Hausa、Igbo、Swahili 和 Yorùbá)以及一种零样本测试语言(俄语)。该数据集通过转录双语报告中的图表和伴随文本,并进行专业翻译,使其完全并行。
语言和多语性
- 语言: 阿拉伯语、英语、法语、Hausa、Igbo、葡萄牙语、俄语、Swahili、Yorùbá
- 多语性: 是
许可
- 许可: cc-by-sa-4.0(Creative Commons Attribution Share Alike 4.0 International)
数据集结构
- 特征:
gem_id: 字符串类型example_id: 字符串类型title: 字符串类型unit_of_measure: 字符串类型chart_type: 字符串类型was_translated: 字符串类型table_data: 字符串类型linearized_input: 字符串类型table_text: 序列类型,字符串target: 字符串类型
数据分割
- 分割:
ru: 308435 字节,210 个示例test: 1691383 字节,763 个示例train: 10019272 字节,6962 个示例validation: 1598442 字节,754 个示例
下载和数据集大小
- 下载大小: 18543506 字节
- 数据集大小: 13617532 字节
数据集详情
数据集创建
- 创建者: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
- 组织: Google Research
- 资助: Google Research
数据集结构
- 数据字段:
example_id: 示例IDtitle: 表格标题unit_of_measure: 数据数值描述chart_type: 图表类型was_translated: 表格是否被翻译table_data: 表格内容,JSON编码字符串table_text: 表格描述句子,JSON对象linearized_input: 表格内容线性化字符串
数据分割标准
- 分割标准: 同一表格在不同语言中始终在同一分割中。开发和测试分割中的每个示例至少有3个参考。
示例实例
json { "example_id": "FR346-en-39", "title": "Trends in early childhood mortality rates", "unit_of_measure": "Deaths per 1,000 live births for the 5-year period before the survey", "chart_type": "Line chart", "was_translated": "False", "table_data": "[["", "Child mortality", "Neonatal mortality", "Infant mortality", "Under-5 mortality"], ["1990 JPFHS", 5, 21, 34, 39], ["1997 JPFHS", 6, 19, 29, 34], ["2002 JPFHS", 5, 16, 22, 27], ["2007 JPFHS", 2, 14, 19, 21], ["2009 JPFHS", 5, 15, 23, 28], ["2012 JPFHS", 4, 14, 17, 21], ["2017-18 JPFHS", 3, 11, 17, 19]]", "table_text": [ "neonatal, infant, child, and under-5 mortality rates for the 5 years preceding each of seven JPFHS surveys (1990 to 2017-18).", "Under-5 mortality declined by half over the period, from 39 to 19 deaths per 1,000 live births.", "The decline in mortality was much greater between the 1990 and 2007 surveys than in the most recent period.", "Between 2012 and 2017-18, under-5 mortality decreased only modestly, from 21 to 19 deaths per 1,000 live births, and infant mortality remained stable at 17 deaths per 1,000 births." ], "linearized_input": "Trends in early childhood mortality rates | Deaths per 1,000 live births for the 5-year period before the survey | (Child mortality, 1990 JPFHS, 5) (Neonatal mortality, 1990 JPFHS, 21) (Infant mortality, 1990 JPFHS, 34) (Under-5 mortality, 1990 JPFHS, 39) (Child mortality, 1997 JPFHS, 6) (Neonatal mortality, 1997 JPFHS, 19) (Infant mortality, 1997 JPFHS, 29) (Under-5 mortality, 1997 JPFHS, 34) (Child mortality, 2002 JPFHS, 5) (Neonatal mortality, 2002 JPFHS, 16) (Infant mortality, 2002 JPFHS, 22) (Under-5 mortality, 2002 JPFHS, 27) (Child mortality, 2007 JPFHS, 2) (Neonatal mortality, 2007 JPFHS, 14) (Infant mortality, 2007 JPFHS, 19) (Under-5 mortality, 2007 JPFHS, 21) (Child mortality, 2009 JPFHS, 5) (Neonatal mortality, 2009 JPFHS, 15) (Infant mortality, 2009 JPFHS, 23) (Under-5 mortality, 2009 JPFHS, 28) (Child mortality, 2012 JPFHS, 4) (Neonatal mortality, 2012 JPFHS, 14) (Infant mortality, 2012 JPFHS, 17) (Under-5 mortality, 2012 JPFHS, 21) (Child mortality, 2017-18 JPFHS, 3) (Neonatal mortality, 2017-18 JPFHS, 11) (Infant mortality, 2017-18 JPFHS, 17) (Under-5 mortality, 2017-18 JPFHS, 19)" }
数据集在 GEM 中的作用
- 数据集在 GEM 中的作用: 该数据集是唯一一个多语种并行数据-到-文本数据集,超过70%的参考需要推理,因此质量高且对模型具有挑战性。
数据集创建细节
- 语言数据获取: 语言数据来自美国国际开发署的人口和健康调查项目(https://dhsprogram.com/)。
- 主题覆盖: 数据集涵盖生育、家庭计划、母婴健康、性别和营养等主题。
- 数据验证: 通过众包工作者进行验证。
- 数据过滤: 未进行过滤。
结构化注释
- 额外注释: 专家创建
- **评



