tatoeba_mt
收藏魔搭社区2025-12-05 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/tatoeba_mt
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for [Dataset Name]
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/Helsinki-NLP/Tatoeba-Challenge/
- **Repository:** https://github.com/Helsinki-NLP/Tatoeba-Challenge/
- **Paper:** [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/)
- **Leaderboard:**
- **Point of Contact:** [Jörg Tiedemann](mailto:jorg.tiedemann@helsinki.fi)
### Dataset Summary
The Tatoeba Translation Challenge is a multilingual data set of machine translation benchmarks derived from user-contributed translations collected by [Tatoeba.org](https://tatoeba.org/) and provided as parallel corpus from [OPUS](https://opus.nlpl.eu/). This dataset includes test and development data sorted by language pair. It includes test sets for hundreds of language pairs and is continuously updated. Please, check the version number tag to refer to the release that your are using.
### Supported Tasks and Leaderboards
The translation task is described in detail in the [Tatoeba-Challenge repository](https://github.com/Helsinki-NLP/Tatoeba-Challenge) and covers various sub-tasks with different data coverage and resources. [Training data](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/README.md) is also available from the same repository and [results](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-results-all.md) are published and collected as well. [Models](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-models-all.md) are also released for public use and are also partially available from the [huggingface model hub](https://huggingface.co/Helsinki-NLP).
### Languages
The data set covers hundreds of languages and language pairs and are organized by ISO-639-3 languages. The current release covers the following language: Afrikaans, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Breton, Bosnian, Catalan, Chamorro, Czech, Chuvash, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, Faroese, French, Western Frisian, Irish, Scottish Gaelic, Galician, Guarani, Hebrew, Hindi, Croatian, Hungarian, Armenian, Interlingua, Indonesian, Interlingue, Ido, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Khmer, Korean, Kurdish, Cornish, Latin, Luxembourgish, Lithuanian, Latvian, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Norwegian Bokmål, Dutch, Norwegian Nynorsk, Norwegian, Occitan, Polish, Portuguese, Quechua, Rundi, Romanian, Russian, Serbo-Croatian, Slovenian, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Thai, Turkmen, Tagalog, Turkish, Tatar, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Yiddish, Chinese
## Dataset Structure
### Data Instances
Data instances are given as translation units in TAB-separated files with four columns: source and target language ISO-639-3 codes, source language text and target language text. Note that we do not imply a translation direction and consider the data set to be symmetric and to be used as a test set in both directions. Language-pair-specific subsets are only provided under the label of one direction using sorted ISO-639-3 language IDs.
Some subsets contain several sub-languages or language variants. They may refer to macro-languages such as Serbo-Croatian languages that are covered by the ISO code `hbs`. Language variants may also include different writing systems and in that case the ISO15924 script codes are attached to the language codes. Here are a few examples from the English to Serbo-Croation test set including examples for Bosnian, Croatian and Serbian in Cyrillic and in Latin scripts:
```
eng bos_Latn Children are the flowers of our lives. Djeca su cvijeće našeg života.
eng hrv A bird was flying high up in the sky. Ptica je visoko letjela nebom.
eng srp_Cyrl A bird in the hand is worth two in the bush. Боље врабац у руци, него голуб на грани.
eng srp_Latn Canada is the motherland of ice hockey. Kanada je zemlja-majka hokeja na ledu.
```
There are also data sets with sentence pairs in the same language. In most cases, those are variants with minor spelling differences but they also include rephrased sentences. Here are a few examples from the English test set:
```
eng eng All of us got into the car. We all got in the car.
eng eng All of us hope that doesn't happen. All of us hope that that doesn't happen.
eng eng All the seats are booked. The seats are all sold out.
```
### Data Splits
Test and development data sets are disjoint with respect to sentence pairs but may include overlaps in individual source or target language sentences. Development data should not be used in training directly. The goal of the data splits is to create test sets of reasonable size with a large language coverage. Test sets include at most 10,000 instances. Development data do not exist for all language pairs.
To be comparable with other results, models should use the training data distributed from the [Tatoeba MT Challenge Repository](https://github.com/Helsinki-NLP/Tatoeba-Challenge/) including monolingual data sets also listed there.
## Dataset Creation
### Curation Rationale
The Tatoeba MT data set will be updated continuously and the data preparation procedures are also public and released on [github](https://github.com/Helsinki-NLP/Tatoeba-Challenge/). High language coverage is the main goal of the project and data sets are prepared to be consistent and systematic with standardized language labels and distribution formats.
### Source Data
#### Initial Data Collection and Normalization
The Tatoeba data sets are collected from user-contributed translations submitted to [Tatoeba.org](https://tatoeba.org/) and compiled into a multi-parallel corpus in [OPUS](https://opus.nlpl.eu/Tatoeba.php). The test and development sets are incrementally updated with new releases of the Tatoeba data collection at OPUS. New releases extend the existing data sets. Test sets should not overlap with any of the released development data sets.
#### Who are the source language producers?
The data sets come from [Tatoeba.org](https://tatoeba.org/), which provides a large database of sentences and their translations into a wide varity of languages. Its content is constantly growing as a result of voluntary contributions of thousands of users.
The original project was founded by Trang Ho in 2006, hosted on Sourceforge under the codename of multilangdict.
### Annotations
#### Annotation process
Sentences are translated by volunteers and the Tatoeba database also provides additional metadata about each record including user ratings etc. However, the metadata is currently not used in any way for the compilation of the MT benchmark. Language skills of contributors naturally vary quite a bit and not all translations are done by native speakers of the target language. More information about the contributions can be found at [Tatoeba.org](https://tatoeba.org/).
#### Who are the annotators?
### Personal and Sensitive Information
For information about handling personal and sensitive information we refer to the [original provider](https://tatoeba.org/) of the data. This data set has not been processed in any way to detect or remove potentially sensitve or personal information.
## Considerations for Using the Data
### Social Impact of Dataset
The language coverage is high and with that it represents a highly valuable resource for machine translation development especially for lesser resourced languages and language pairs. The constantly growing database also represents a dynamic resource and its value wil grow further.
### Discussion of Biases
The original source lives from its contributors and there interest and background will to certain subjective and cultural biases. Language coverage and translation quality is also biased by the skills of the contributors.
### Other Known Limitations
The sentences are typically quite short and, therefore, rather easy to translate. For high-resource languages, this leads to results that will be less useful than more challenging benchmarks. For lesser resource language pairs, the limited complexity of the examples is actually a good thing to measure progress even in very challenging setups.
## Additional Information
### Dataset Curators
The data set is curated by the University of Helsinki and its [language technology research group](https://blogs.helsinki.fi/language-technology/). Data and tools used for creating and using the resource are [open source](https://github.com/Helsinki-NLP/Tatoeba-Challenge/) and will be maintained as part of the [OPUS ecosystem](https://opus.nlpl.eu/) for parallel data and machine translation research.
### Licensing Information
The data sets are distributed under the same licence agreement as the original Tatoeba database using a
[CC-BY 2.0 license](https://creativecommons.org/licenses/by/2.0/fr/). More information about the terms of use of the original data sets is listed [here](https://tatoeba.org/eng/terms_of_use).
### Citation Information
If you use the data sets then, please, cite the following paper: [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/)
```
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
```
### Contributions
Thanks to [@jorgtied](https://github.com/jorgtied) and [@Helsinki-NLP](https://github.com/Helsinki-NLP) for adding this dataset.
Thanks also to [CSC Finland](https://www.csc.fi/en/solutions-for-research) for providing computational resources and storage space for the work on OPUS and other MT projects.
# [数据集名称] 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页:** https://github.com/Helsinki-NLP/Tatoeba-Challenge/
- **代码仓库:** https://github.com/Helsinki-NLP/Tatoeba-Challenge/
- **论文:** [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/)
- **排行榜:**
- **联络人:** [约尔格·蒂德曼(Jörg Tiedemann)](mailto:jorg.tiedemann@helsinki.fi)
### 数据集概述
塔托埃巴翻译挑战赛(Tatoeba Translation Challenge)是一个多语种机器翻译(Machine Translation, MT)基准数据集,其数据源自[Tatoeba.org](https://tatoeba.org/)收集的用户贡献译文,并以平行语料库形式从[OPUS](https://opus.nlpl.eu/)获取。本数据集包含按语言对分类的测试集与开发集,涵盖数百个语言对的测试集,且会持续更新。请通过版本号标签定位您所使用的数据集发布版本。
### 支持任务与排行榜
翻译任务的详细说明见[Tatoeba-Challenge代码仓库](https://github.com/Helsinki-NLP/Tatoeba-Challenge),该任务包含多个数据覆盖范围与资源配置各异的子任务。训练数据可从同一仓库获取,实验结果也已公开收集并发布,相关模型同样面向公众开放,部分模型可在[Hugging Face模型中心](https://huggingface.co/Helsinki-NLP)获取。
### 语言覆盖
本数据集涵盖数百种语言与语言对,以ISO-639-3语言编码进行组织。当前发布版本包含以下语言:南非荷兰语、阿拉伯语、阿塞拜疆语、白俄罗斯语、保加利亚语、孟加拉语、布列塔尼语、波斯尼亚语、加泰罗尼亚语、查莫罗语、捷克语、楚瓦什语、威尔士语、丹麦语、德语、现代希腊语、英语、世界语、西班牙语、爱沙尼亚语、巴斯克语、波斯语、芬兰语、法罗语、法语、西弗里西亚语、爱尔兰语、苏格兰盖尔语、加利西亚语、瓜拉尼语、希伯来语、印地语、克罗地亚语、匈牙利语、亚美尼亚语、国际语(Interlingua)、印尼语、国际语E(Interlingue)、伊多语(Ido)、冰岛语、意大利语、日语、爪哇语、格鲁吉亚语、哈萨克语、高棉语、韩语、库尔德语、康沃尔语、拉丁语、卢森堡语、立陶宛语、拉脱维亚语、毛利语、马其顿语、马拉雅拉姆语、蒙古语、马拉地语、马来语、马耳他语、缅甸语、书面挪威语(Norwegian Bokmål)、荷兰语、新挪威语(Norwegian Nynorsk)、挪威语、奥克西坦语、波兰语、葡萄牙语、克丘亚语、隆迪语、罗马尼亚语、俄语、塞尔维亚-克罗地亚语、斯洛文尼亚语、阿尔巴尼亚语、塞尔维亚语、瑞典语、斯瓦希里语、泰米尔语、泰卢固语、泰语、土库曼语、他加禄语、土耳其语、鞑靼语、维吾尔语、乌克兰语、乌尔都语、乌兹别克语、越南语、沃拉普克语、意第绪语、中文。
## 数据集结构
### 数据实例
数据实例以制表符分隔的文件形式存储,每个翻译单元包含四列:源语言与目标语言的ISO-639-3编码、源语言文本、目标语言文本。请注意,本数据集未限定翻译方向,视为对称数据集,可双向用作测试集。特定语言对的子集仅以单一方向的排序ISO-639-3语言ID进行标注。
部分子集包含多种子语言或语言变体,例如ISO编码为`hbs`的塞尔维亚-克罗地亚语宏语言。语言变体可能包含不同书写系统,此时会在语言编码后附加ISO 15924书写系统编码。以下为英语到塞尔维亚-克罗地亚语测试集的示例,涵盖使用西里尔字母与拉丁字母的波斯尼亚语、克罗地亚语及塞尔维亚语译文:
eng bos_Latn Children are the flowers of our lives. Djeca su cvijeće našeg života.
eng hrv A bird was flying high up in the sky. Ptica je visoko letjela nebom.
eng srp_Cyrl A bird in the hand is worth two in the bush. Боље врабац у руци, него голуб на грани.
eng srp_Latn Canada is the motherland of ice hockey. Kanada je zemlja-majka hokeja na ledu.
本数据集还包含同语言句对,这类数据大多为拼写差异较小的变体,也包含重述后的句子。以下为英语测试集的示例:
eng eng All of us got into the car. We all got in the car.
eng eng All of us hope that doesn't happen. All of us hope that that doesn't happen.
eng eng All the seats are booked. The seats are all sold out.
### 数据划分
测试集与开发集的句对互不重叠,但单个源语言或目标语言句子可能存在跨集重复。开发集不可直接用于模型训练。数据划分的目标是构建覆盖语种广泛、规模合理的测试集,每个测试集最多包含10000个实例。并非所有语言对都配有开发集。
为确保实验结果可与其他研究对比,模型应使用[Tatoeba MT挑战赛代码仓库](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)发布的训练数据,包括该仓库中列出的单语数据集。
## 数据集构建
### 遴选依据
塔托埃巴机器翻译数据集将持续更新,数据预处理流程也已公开并发布至[GitHub](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)。本项目的核心目标是实现高语种覆盖,数据集采用标准化语言标签与发布格式,确保整体一致性与系统性。
### 源数据
#### 初始数据收集与标准化
塔托埃巴数据集源自[Tatoeba.org](https://tatoeba.org/)收集的用户贡献译文,并在[OPUS](https://opus.nlpl.eu/Tatoeba.php)中整合为多平行语料库。测试集与开发集会随OPUS中塔托埃巴数据集的新版本发布逐步更新,新版本会扩充现有数据集规模。测试集不得与任何已发布的开发集存在句对重叠。
#### 源语言内容创作者
本数据集源自[Tatoeba.org](https://tatoeba.org/),该平台拥有海量句子及其多语种译句数据库,其内容因数千名用户的自愿贡献而持续增长。该项目最初由Trang Ho于2006年创立,最初托管于Sourceforge,项目代号为multilangdict。
### 标注信息
#### 标注流程
句子由志愿者翻译,塔托埃巴数据库还为每条记录提供包括用户评分在内的附加元数据。但目前机器翻译基准数据集的构建未使用任何元数据。贡献者的语言能力参差不齐,并非所有译文均由目标语母语者完成。更多关于贡献内容的信息可参阅[Tatoeba.org](https://tatoeba.org/)。
#### 标注人员构成
### 个人与敏感信息
关于个人与敏感信息的处理规范,请参阅数据集的原始提供方[Tatoeba.org](https://tatoeba.org/)。本数据集未经过任何旨在检测或移除潜在敏感或个人信息的处理流程。
## 数据使用注意事项
### 数据集的社会影响
本数据集语种覆盖广泛,是机器翻译研发的宝贵资源,尤其对低资源语言及语言对而言意义重大。持续增长的数据库属于动态资源,其价值将不断提升。
### 偏差讨论
本数据集的原始来源依赖于志愿者贡献,贡献者的兴趣与背景会引入一定的主观与文化偏差。语种覆盖范围与译文质量也会受到贡献者语言能力的影响。
### 其他已知局限性
本数据集的句子通常较短,因此翻译难度较低。对于高资源语言而言,这会导致实验结果不如更具挑战性的基准测试实用;但对于低资源语言对而言,示例的低复杂度反而有助于在极具挑战的场景下衡量模型的进展。
## 附加信息
### 数据集策展人
本数据集由赫尔辛基大学及其[语言技术研究团队](https://blogs.helsinki.fi/language-technology/)策展。用于构建与使用该资源的数据与工具均已开源,开源仓库地址为[https://github.com/Helsinki-NLP/Tatoeba-Challenge/](https://github.com/Helsinki-NLP/Tatoeba-Challenge/),并将作为[OPUS平行语料库与机器翻译研究生态系统](https://opus.nlpl.eu/)的一部分持续维护。
### 许可信息
本数据集采用与原始塔托埃巴数据库一致的授权协议,即[CC BY 2.0许可协议](https://creativecommons.org/licenses/by/2.0/fr/)。关于原始数据集的使用条款的更多信息可参阅[此处](https://tatoeba.org/eng/terms_of_use)。
### 引用信息
若您使用本数据集,请引用以下论文:[The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/)
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
### 贡献者
感谢[@jorgtied](https://github.com/jorgtied)与[@Helsinki-NLP](https://github.com/Helsinki-NLP)贡献本数据集。同时感谢[芬兰CSC超算中心](https://www.csc.fi/en/solutions-for-research)为OPUS及其他机器翻译项目提供计算资源与存储空间。
提供机构:
maas
创建时间:
2025-08-16



