five

TATA

收藏
arXiv2022-11-01 更新2024-06-21 收录
下载链接:
https://github.com/google-research/url-nlp
下载链接
链接失效反馈
官方服务:
资源简介:
TATA是由谷歌研究院创建的首个大型多语言表格到文本数据集,专注于非洲语言,包含8700个示例,涵盖九种语言。数据集通过转录人口与健康调查项目发布的双语报告中的图表和文本,并进行专业翻译来创建。TATA不仅包括非洲语言(豪萨语、伊博语、斯瓦希里语和约鲁巴语),还包括一种零射击测试语言(俄语)。此外,数据集还发布了原始图表的截图,以支持未来对多语言多模态方法的研究。通过深入的人类评估,TATA被证明对当前模型具有挑战性,且现有自动评估指标表现不佳,为此引入了与人类判断高度相关的新学习评估指标。

TATA is the first large-scale multilingual table-to-text dataset developed by Google Research, focusing on African languages, with 8,700 examples spanning nine languages. The dataset is constructed by transcribing charts and texts from bilingual reports published by the Demographic and Health Surveys (DHS) program, followed by professional translation. TATA includes not only four African languages: Hausa, Igbo, Swahili, and Yoruba, but also one zero-shot test language, Russian. Additionally, screenshots of the original charts are released together with the dataset to support future research on multilingual multimodal methods. Through in-depth human evaluation, TATA has been proven to be challenging for current models, and existing automatic evaluation metrics perform poorly. To address this issue, a new learned evaluation metric that is highly correlated with human judgment was introduced.
提供机构:
谷歌研究院
创建时间:
2022-11-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作