wmt/wmt17
收藏Hugging Face2024-04-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wmt/wmt17
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- cs
- de
- en
- fi
- lv
- ru
- tr
- zh
license:
- unknown
multilinguality:
- translation
size_categories:
- 10M<n<100M
source_datasets:
- extended|europarl_bilingual
- extended|news_commentary
- extended|setimes
- extended|un_multi
task_categories:
- translation
task_ids: []
pretty_name: WMT17
dataset_info:
- config_name: cs-en
features:
- name: translation
dtype:
translation:
languages:
- cs
- en
splits:
- name: train
num_bytes: 300697615
num_examples: 1018291
- name: validation
num_bytes: 707862
num_examples: 2999
- name: test
num_bytes: 674422
num_examples: 3005
download_size: 181690407
dataset_size: 302079899
- config_name: de-en
features:
- name: translation
dtype:
translation:
languages:
- de
- en
splits:
- name: train
num_bytes: 1715532715
num_examples: 5906184
- name: validation
num_bytes: 735508
num_examples: 2999
- name: test
num_bytes: 729511
num_examples: 3004
download_size: 1011327465
dataset_size: 1716997734
- config_name: fi-en
features:
- name: translation
dtype:
translation:
languages:
- fi
- en
splits:
- name: train
num_bytes: 743854397
num_examples: 2656542
- name: validation
num_bytes: 1410507
num_examples: 6000
- name: test
num_bytes: 1388820
num_examples: 6004
download_size: 423069132
dataset_size: 746653724
- config_name: lv-en
features:
- name: translation
dtype:
translation:
languages:
- lv
- en
splits:
- name: train
num_bytes: 517416244
num_examples: 3567528
- name: validation
num_bytes: 544596
num_examples: 2003
- name: test
num_bytes: 530466
num_examples: 2001
download_size: 245201883
dataset_size: 518491306
- config_name: ru-en
features:
- name: translation
dtype:
translation:
languages:
- ru
- en
splits:
- name: train
num_bytes: 11000055690
num_examples: 24782720
- name: validation
num_bytes: 1050669
num_examples: 2998
- name: test
num_bytes: 1040187
num_examples: 3001
download_size: 4866529051
dataset_size: 11002146546
- config_name: tr-en
features:
- name: translation
dtype:
translation:
languages:
- tr
- en
splits:
- name: train
num_bytes: 60416449
num_examples: 205756
- name: validation
num_bytes: 732428
num_examples: 3000
- name: test
num_bytes: 752765
num_examples: 3007
download_size: 37706176
dataset_size: 61901642
- config_name: zh-en
features:
- name: translation
dtype:
translation:
languages:
- zh
- en
splits:
- name: train
num_bytes: 6336104073
num_examples: 25134743
- name: validation
num_bytes: 589583
num_examples: 2002
- name: test
num_bytes: 540339
num_examples: 2001
download_size: 3576239952
dataset_size: 6337233995
configs:
- config_name: cs-en
data_files:
- split: train
path: cs-en/train-*
- split: validation
path: cs-en/validation-*
- split: test
path: cs-en/test-*
- config_name: de-en
data_files:
- split: train
path: de-en/train-*
- split: validation
path: de-en/validation-*
- split: test
path: de-en/test-*
- config_name: fi-en
data_files:
- split: train
path: fi-en/train-*
- split: validation
path: fi-en/validation-*
- split: test
path: fi-en/test-*
- config_name: lv-en
data_files:
- split: train
path: lv-en/train-*
- split: validation
path: lv-en/validation-*
- split: test
path: lv-en/test-*
- config_name: ru-en
data_files:
- split: train
path: ru-en/train-*
- split: validation
path: ru-en/validation-*
- split: test
path: ru-en/test-*
- config_name: tr-en
data_files:
- split: train
path: tr-en/train-*
- split: validation
path: tr-en/validation-*
- split: test
path: tr-en/test-*
- config_name: zh-en
data_files:
- split: train
path: zh-en/train-*
- split: validation
path: zh-en/validation-*
- split: test
path: zh-en/test-*
---
# Dataset Card for "wmt17"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [http://www.statmt.org/wmt17/translation-task.html](http://www.statmt.org/wmt17/translation-task.html)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 1.78 GB
- **Size of the generated dataset:** 302.09 MB
- **Total amount of disk used:** 2.09 GB
### Dataset Summary
<div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400">
<p><b>Warning:</b> There are issues with the Common Crawl corpus data (<a href="https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz">training-parallel-commoncrawl.tgz</a>):</p>
<ul>
<li>Non-English files contain many English sentences.</li>
<li>Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart.</li>
</ul>
<p>We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains to the expectation that such data has been superseded, primarily by CCMatrix, and to some extent, by ParaCrawl datasets.</p>
</div>
Translation dataset based on the data from statmt.org.
Versions exist for different years using a combination of data
sources. The base `wmt` allows you to create a custom dataset by choosing
your own data/language pair. This can be done as follows:
```python
from datasets import inspect_dataset, load_dataset_builder
inspect_dataset("wmt17", "path/to/scripts")
builder = load_dataset_builder(
"path/to/scripts/wmt_utils.py",
language_pair=("fr", "de"),
subsets={
datasets.Split.TRAIN: ["commoncrawl_frde"],
datasets.Split.VALIDATION: ["euelections_dev2019"],
},
)
# Standard version
builder.download_and_prepare()
ds = builder.as_dataset()
# Streamable version
ds = builder.as_streaming_dataset()
```
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### cs-en
- **Size of downloaded dataset files:** 1.78 GB
- **Size of the generated dataset:** 302.09 MB
- **Total amount of disk used:** 2.09 GB
An example of 'train' looks as follows.
```
```
### Data Fields
The data fields are the same among all splits.
#### cs-en
- `translation`: a multilingual `string` variable, with possible languages including `cs`, `en`.
### Data Splits
|name | train |validation|test|
|-----|------:|---------:|---:|
|cs-en|1018291| 2999|3005|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@InProceedings{bojar-EtAl:2017:WMT1,
author = {Bojar, Ond
{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco},
title = {Findings of the 2017 Conference on Machine Translation (WMT17)},
booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {169--214},
url = {http://www.aclweb.org/anthology/W17-4717}
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
annotations_creators:
- 无注释
language_creators:
- 采集所得
language:
- 捷克语(cs)
- 德语(de)
- 英语(en)
- 芬兰语(fi)
- 拉脱维亚语(lv)
- 俄语(ru)
- 土耳其语(tr)
- 中文(zh)
license:
- 未知
multilinguality:
- 翻译语料
size_categories:
- 1000万<样本数<1亿
source_datasets:
- 扩展|欧洲议会双语语料库(europarl_bilingual)
- 扩展|新闻评论语料库(news_commentary)
- 扩展|Setimes语料库(setimes)
- 扩展|联合国多国语言语料库(un_multi)
task_categories:
- 机器翻译
task_ids: []
pretty_name: WMT17
dataset_info:
- config_name: 捷克语-英语(cs-en)
features:
- name: translation
dtype:
translation:
languages:
- cs
- en
splits:
- name: train
num_bytes: 300697615
num_examples: 1018291
- name: validation
num_bytes: 707862
num_examples: 2999
- name: test
num_bytes: 674422
num_examples: 3005
download_size: 181690407
dataset_size: 302079899
- config_name: 德语-英语(de-en)
features:
- name: translation
dtype:
translation:
languages:
- de
- en
splits:
- name: train
num_bytes: 1715532715
num_examples: 5906184
- name: validation
num_bytes: 735508
num_examples: 2999
- name: test
num_bytes: 729511
num_examples: 3004
download_size: 1011327465
dataset_size: 1716997734
- config_name: 芬兰语-英语(fi-en)
features:
- name: translation
dtype:
translation:
languages:
- fi
- en
splits:
- name: train
num_bytes: 743854397
num_examples: 2656542
- name: validation
num_bytes: 1410507
num_examples: 6000
- name: test
num_bytes: 1388820
num_examples: 6004
download_size: 423069132
dataset_size: 746653724
- config_name: 拉脱维亚语-英语(lv-en)
features:
- name: translation
dtype:
translation:
languages:
- lv
- en
splits:
- name: train
num_bytes: 517416244
num_examples: 3567528
- name: validation
num_bytes: 544596
num_examples: 2003
- name: test
num_bytes: 530466
num_examples: 2001
download_size: 245201883
dataset_size: 518491306
- config_name: 俄语-英语(ru-en)
features:
- name: translation
dtype:
translation:
languages:
- ru
- en
splits:
- name: train
num_bytes: 11000055690
num_examples: 24782720
- name: validation
num_bytes: 1050669
num_examples: 2998
- name: test
num_bytes: 1040187
num_examples: 3001
download_size: 4866529051
dataset_size: 11002146546
- config_name: 土耳其语-英语(tr-en)
features:
- name: translation
dtype:
translation:
languages:
- tr
- en
splits:
- name: train
num_bytes: 60416449
num_examples: 205756
- name: validation
num_bytes: 732428
num_examples: 3000
- name: test
num_bytes: 752765
num_examples: 3007
download_size: 37706176
dataset_size: 61901642
- config_name: 中文-英语(zh-en)
features:
- name: translation
dtype:
translation:
languages:
- zh
- en
splits:
- name: train
num_bytes: 6336104073
num_examples: 25134743
- name: validation
num_bytes: 589583
num_examples: 2002
- name: test
num_bytes: 540339
num_examples: 2001
download_size: 3576239952
dataset_size: 6337233995
configs:
- config_name: 捷克语-英语(cs-en)
data_files:
- split: train
path: cs-en/train-*
- split: validation
path: cs-en/validation-*
- split: test
path: cs-en/test-*
- config_name: 德语-英语(de-en)
data_files:
- split: train
path: de-en/train-*
- split: validation
path: de-en/validation-*
- split: test
path: de-en/test-*
- config_name: 芬兰语-英语(fi-en)
data_files:
- split: train
path: fi-en/train-*
- split: validation
path: fi-en/validation-*
- split: test
path: fi-en/test-*
- config_name: 拉脱维亚语-英语(lv-en)
data_files:
- split: train
path: lv-en/train-*
- split: validation
path: lv-en/validation-*
- split: test
path: lv-en/test-*
- config_name: 俄语-英语(ru-en)
data_files:
- split: train
path: ru-en/train-*
- split: validation
path: ru-en/validation-*
- split: test
path: ru-en/test-*
- config_name: 土耳其语-英语(tr-en)
data_files:
- split: train
path: tr-en/train-*
- split: validation
path: tr-en/validation-*
- split: test
path: tr-en/test-*
- config_name: 中文-英语(zh-en)
data_files:
- split: train
path: zh-en/train-*
- split: validation
path: zh-en/validation-*
- split: test
path: zh-en/test-*
# 数据集卡片:"WMT17"
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务与排行榜)
- [支持语言](#支持语言)
- [数据集结构](#数据集结构)
- [数据样例](#数据样例)
- [数据字段](#数据字段)
- [数据拆分](#数据拆分)
- [数据集构建](#数据集构建)
- [数据遴选依据](#数据遴选依据)
- [源数据](#源数据)
- [注释信息](#注释信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可证信息](#许可证信息)
- [引用信息](#引用信息)
- [贡献者](#贡献者)
## 数据集描述
- **主页:** [http://www.statmt.org/wmt17/translation-task.html](http://www.statmt.org/wmt17/translation-task.html)
- **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **论文:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集总大小:** 1.78 GB
- **生成后数据集总大小:** 302.09 MB
- **占用磁盘总空间:** 2.09 GB
### 数据集概述
<div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray--400">
<p><b>警告:</b> Common Crawl语料库数据(<a href="https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz">training-parallel-commoncrawl.tgz</a>)存在以下问题:</p>
<ul>
<li>非英语文件中包含大量英语句子。</li>
<li>其对应的“平行”英语句子未对齐,二者之间无关联。</li>
</ul>
<p>我们已联系WMT主办方,对方表示暂无更新该Common Crawl语料库的计划,其理由是该类数据已被CCMatrix以及一定程度上的ParaCrawl数据集所替代。</p>
</div>
本翻译数据集基于statmt.org提供的数据构建。不同年份的WMT数据集版本通过整合多源数据构建,基础的`wmt`模块支持通过自定义选择数据/语言对来生成定制化数据集,具体实现方式如下:
python
from datasets import inspect_dataset, load_dataset_builder
inspect_dataset("wmt17", "path/to/scripts")
builder = load_dataset_builder(
"path/to/scripts/wmt_utils.py",
language_pair=("fr", "de"),
subsets={
datasets.Split.TRAIN: ["commoncrawl_frde"],
datasets.Split.VALIDATION: ["euelections_dev2019"],
},
)
# 标准版本
builder.download_and_prepare()
ds = builder.as_dataset()
# 流式版本
ds = builder.as_streaming_dataset()
### 支持任务与排行榜
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 支持语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据样例
#### 捷克语-英语(cs-en)
- **下载数据集大小:** 1.78 GB
- **生成后数据集大小:** 302.09 MB
- **占用磁盘总空间:** 2.09 GB
训练集样例如下:
### 数据字段
所有拆分的数据字段格式均保持一致。
#### 捷克语-英语(cs-en)
- `translation`: 多语言字符串类型变量,支持的语言包括捷克语(cs)与英语(en)。
### 数据拆分
|拆分名称 | 训练集样本数 |验证集样本数|测试集样本数|
|-----|------:|---------:|---:|
|cs-en|1018291| 2999|3005|
## 数据集构建
### 数据遴选依据
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据采集与归一化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言文本创作者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 注释信息
#### 注释流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 注释人员是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可证信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
@InProceedings{bojar-EtAl:2017:WMT1,
author = {Bojar, Ond
{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco},
title = {Findings of the 2017 Conference on Machine Translation (WMT17)},
booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {169--214},
url = {http://www.aclweb.org/anthology/W17-4717}
}
### 贡献者
感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@thomwolf](https://github.com/thomwolf) 为本数据集的添加工作。
提供机构:
wmt
原始信息汇总
数据集概述
基本信息
- 名称: WMT17
- 语言: 支持多种语言,包括cs, de, en, fi, lv, ru, tr, zh
- 许可: 未知
- 多语言性: 翻译
- 大小: 10M<n<100M
数据集结构
- 任务类别: 翻译
- 配置: 多个语言对,如cs-en, de-en, fi-en等
- 特征: 每个配置包含一个名为
translation的特征,类型为字符串,支持的语言对在每个配置中指定 - 数据分割: 每个配置包含训练、验证和测试集,详细数据如下:
| 配置名称 | 训练集字节数 | 训练集示例数 | 验证集字节数 | 验证集示例数 | 测试集字节数 | 测试集示例数 |
|---|---|---|---|---|---|---|
| cs-en | 300697615 | 1018291 | 707862 | 2999 | 674422 | 3005 |
| de-en | 1715532715 | 5906184 | 735508 | 2999 | 729511 | 3004 |
| fi-en | 743854397 | 2656542 | 1410507 | 6000 | 1388820 | 6004 |
| lv-en | 517416244 | 3567528 | 544596 | 2003 | 530466 | 2001 |
| ru-en | 11000055690 | 24782720 | 1050669 | 2998 | 1040187 | 3001 |
| tr-en | 60416449 | 205756 | 732428 | 3000 | 752765 | 3007 |
| zh-en | 6336104073 | 25134743 | 589583 | 2002 | 540339 | 2001 |
数据集创建
- 来源数据: 扩展自多个数据集,包括europarl_bilingual, news_commentary, setimes, un_multi
- 注释: 无注释
- 个人和敏感信息: 未提及
使用考虑
- 社会影响: 未提及
- 偏见讨论: 未提及
- 其他已知限制: 未提及
附加信息
- 数据集管理员: 未提及
- 许可信息: 未知
- 引用信息: 提供了一个引用格式,用于学术引用
- 贡献者: 感谢@patrickvonplaten, @thomwolf的贡献
搜集汇总
数据集介绍

构建方式
WMT17数据集的构建基于statmt.org提供的数据,涵盖了多种语言对的翻译任务。该数据集通过整合多个来源的数据,包括Europarl、News Commentary、SETimes和UN Multi,形成了一个大规模的多语言翻译语料库。每个语言对的配置包括训练、验证和测试集,确保了数据集的完整性和多样性。
特点
WMT17数据集的一个显著特点是其广泛的语言覆盖,包括捷克语、德语、英语、芬兰语、拉脱维亚语、俄语、土耳其语和中文等多种语言。此外,数据集的规模庞大,训练集的样本数量从数十万到数千万不等,为机器翻译模型的训练提供了丰富的资源。
使用方法
使用WMT17数据集时,用户可以通过HuggingFace的datasets库加载特定语言对的配置。例如,使用Python代码可以轻松加载和准备数据集,进行模型训练和评估。数据集的结构清晰,包含翻译字段,便于直接应用于机器翻译任务。
背景与挑战
背景概述
WMT17数据集是由欧洲机器翻译会议(WMT)于2017年发布的一个大规模多语言翻译数据集。该数据集的主要研究人员和机构包括Ondřej Bojar、Rajen Chatterjee、Christian Federmann等,他们通过整合多个来源的数据,如Europarl、News Commentary、SETimes和UN Multi,构建了一个涵盖多种语言对的高质量翻译语料库。WMT17的核心研究问题是如何提高机器翻译系统的性能,特别是在多语言环境下的翻译准确性和流畅性。该数据集的发布对机器翻译领域产生了深远影响,为研究人员提供了一个标准化的基准,促进了翻译模型的进一步优化和创新。
当前挑战
WMT17数据集在构建过程中面临了多个挑战。首先,数据来源的多样性导致了数据质量的不一致,特别是在Common Crawl数据中存在大量非对齐的句子,这增加了数据清洗和预处理的复杂性。其次,多语言翻译任务本身具有较高的难度,不同语言之间的语法结构和文化背景差异使得翻译模型的训练更加复杂。此外,数据集的规模庞大,涉及多种语言对,这要求研究人员在处理和存储数据时具备高效的技术手段。最后,数据集的更新和维护也是一个持续的挑战,特别是在新数据源不断涌现的情况下,如何保持数据集的前沿性和实用性是一个重要的研究方向。
常用场景
经典使用场景
在机器翻译领域,WMT17数据集以其丰富的多语言对齐文本成为经典资源。该数据集广泛应用于训练和评估机器翻译模型,特别是在跨语言信息检索和多语言文本处理任务中。通过提供高质量的平行语料库,WMT17数据集为研究人员和开发者提供了强大的工具,以提升翻译系统的准确性和效率。
衍生相关工作
基于WMT17数据集,许多后续研究工作得以展开。例如,研究人员利用该数据集开发了新的翻译模型,提升了翻译质量。此外,WMT17还启发了对多语言数据处理和分析方法的研究,推动了自然语言处理技术的发展。这些衍生工作不仅丰富了学术研究,也为实际应用提供了新的解决方案。
数据集最近研究
最新研究方向
在机器翻译领域,WMT17数据集的最新研究方向主要集中在多语言翻译模型的优化与扩展。随着全球化的推进,跨语言交流的需求日益增长,研究人员致力于提升翻译系统的准确性和效率。近年来,基于Transformer架构的多语言模型如mBERT和XLM-R,通过共享参数和多任务学习,显著提高了多语言翻译的性能。此外,数据增强技术和预训练模型的结合,使得模型在处理低资源语言对时表现更为出色。这些前沿技术的应用,不仅推动了机器翻译技术的发展,也为全球信息的无缝交流提供了强有力的支持。
以上内容由遇见数据集搜集并总结生成



