iwslt2017
收藏魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/IWSLT/iwslt2017
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for IWSLT 2017
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://sites.google.com/site/iwsltevaluation2017/TED-tasks](https://sites.google.com/site/iwsltevaluation2017/TED-tasks)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [Overview of the IWSLT 2017 Evaluation Campaign](https://aclanthology.org/2017.iwslt-1.1/)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 4.24 GB
- **Size of the generated dataset:** 1.14 GB
- **Total amount of disk used:** 5.38 GB
### Dataset Summary
The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system
across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional
bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### iwslt2017-ar-en
- **Size of downloaded dataset files:** 27.75 MB
- **Size of the generated dataset:** 58.74 MB
- **Total amount of disk used:** 86.49 MB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..."
}
```
#### iwslt2017-de-en
- **Size of downloaded dataset files:** 16.76 MB
- **Size of the generated dataset:** 44.43 MB
- **Total amount of disk used:** 61.18 MB
An example of 'train' looks as follows.
```
{
"translation": {
"de": "Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür.",
"en": "And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful."
}
}
```
#### iwslt2017-en-ar
- **Size of downloaded dataset files:** 29.33 MB
- **Size of the generated dataset:** 58.74 MB
- **Total amount of disk used:** 88.07 MB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..."
}
```
#### iwslt2017-en-de
- **Size of downloaded dataset files:** 16.76 MB
- **Size of the generated dataset:** 44.43 MB
- **Total amount of disk used:** 61.18 MB
An example of 'validation' looks as follows.
```
{
"translation": {
"de": "Die nächste Folie, die ich Ihnen zeige, ist eine Zeitrafferaufnahme was in den letzten 25 Jahren passiert ist.",
"en": "The next slide I show you will be a rapid fast-forward of what's happened over the last 25 years."
}
}
```
#### iwslt2017-en-fr
- **Size of downloaded dataset files:** 27.69 MB
- **Size of the generated dataset:** 51.24 MB
- **Total amount of disk used:** 78.94 MB
An example of 'validation' looks as follows.
```
{
"translation": {
"en": "But this understates the seriousness of this particular problem because it doesn't show the thickness of the ice.",
"fr": "Mais ceci tend à amoindrir le problème parce qu'on ne voit pas l'épaisseur de la glace."
}
}
```
### Data Fields
The data fields are the same among all splits.
#### iwslt2017-ar-en
- `translation`: a multilingual `string` variable, with possible languages including `ar`, `en`.
#### iwslt2017-de-en
- `translation`: a multilingual `string` variable, with possible languages including `de`, `en`.
#### iwslt2017-en-ar
- `translation`: a multilingual `string` variable, with possible languages including `en`, `ar`.
#### iwslt2017-en-de
- `translation`: a multilingual `string` variable, with possible languages including `en`, `de`.
#### iwslt2017-en-fr
- `translation`: a multilingual `string` variable, with possible languages including `en`, `fr`.
### Data Splits
| name |train |validation|test|
|---------------|-----:|---------:|---:|
|iwslt2017-ar-en|231713| 888|8583|
|iwslt2017-de-en|206112| 888|8079|
|iwslt2017-en-ar|231713| 888|8583|
|iwslt2017-en-de|206112| 888|8079|
|iwslt2017-en-fr|232825| 890|8597|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
Creative Commons BY-NC-ND
See the (TED Talks Usage Policy)[https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy].
### Citation Information
```
@inproceedings{cettolo-etal-2017-overview,
title = "Overview of the {IWSLT} 2017 Evaluation Campaign",
author = {Cettolo, Mauro and
Federico, Marcello and
Bentivogli, Luisa and
Niehues, Jan and
St{\"u}ker, Sebastian and
Sudoh, Katsuhito and
Yoshino, Koichiro and
Federmann, Christian},
booktitle = "Proceedings of the 14th International Conference on Spoken Language Translation",
month = dec # " 14-15",
year = "2017",
address = "Tokyo, Japan",
publisher = "International Workshop on Spoken Language Translation",
url = "https://aclanthology.org/2017.iwslt-1.1",
pages = "2--14",
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@Narsil](https://github.com/Narsil) for adding this dataset.
# IWSLT 2017 数据集卡片
## 目录
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## 数据集描述
- **主页:** [https://sites.google.com/site/iwsltevaluation2017/TED-tasks](https://sites.google.com/site/iwsltevaluation2017/TED-tasks)
- **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文:** [Overview of the IWSLT 2017 Evaluation Campaign](https://aclanthology.org/2017.iwslt-1.1/)
- **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小:** 4.24 GB
- **生成数据集大小:** 1.14 GB
- **总磁盘占用空间:** 5.38 GB
### 数据集摘要
IWSLT 2017 多语言任务旨在通过单一机器翻译(Machine Translation, MT)系统完成涵盖英语、德语、荷兰语、意大利语及罗马尼亚语的全方向文本翻译任务,其中包含零样本(zero-shot)翻译。作为非官方任务,该数据集提供英语与阿拉伯语、法语、日语、汉语、德语及韩语之间的常规双语文本翻译服务。
### 支持任务与基准排行榜
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 涉及语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### iwslt2017-ar-en
- **下载数据集文件大小:** 27.75 MB
- **生成数据集大小:** 58.74 MB
- **总磁盘占用空间:** 86.49 MB
训练集的一个示例如下:
This example was too long and was cropped:
{
"translation": "{"ar": "لقد طرت في \"القوات الجوية \" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!", "en": "I flew on Air ..."
}
#### iwslt2017-de-en
- **下载数据集文件大小:** 16.76 MB
- **生成数据集大小:** 44.43 MB
- **总磁盘占用空间:** 61.18 MB
训练集的一个示例如下:
{
"translation": {
"de": "Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür.",
"en": "And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful."
}
}
#### iwslt2017-en-ar
- **下载数据集文件大小:** 29.33 MB
- **生成数据集大小:** 58.74 MB
- **总磁盘占用空间:** 88.07 MB
训练集的一个示例如下:
This example was too long and was cropped:
{
"translation": "{"ar": "لقد طرت في \"القوات الجوية \" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!", "en": "I flew on Air ..."
}
#### iwslt2017-en-de
- **下载数据集文件大小:** 16.76 MB
- **生成数据集大小:** 44.43 MB
- **总磁盘占用空间:** 61.18 MB
验证集的一个示例如下:
{
"translation": {
"de": "Die nächste Folie, die ich Ihnen zeige, ist eine Zeitrafferaufnahme was in den letzten 25 Jahren passiert ist.",
"en": "The next slide I show you will be a rapid fast-forward of what's happened over the last 25 years."
}
}
#### iwslt2017-en-fr
- **下载数据集文件大小:** 27.69 MB
- **生成数据集大小:** 51.24 MB
- **总磁盘占用空间:** 78.94 MB
验证集的一个示例如下:
{
"translation": {
"en": "But this understates the seriousness of this particular problem because it doesn't show the thickness of the ice.",
"fr": "Mais ceci tend à amoindrir le problème parce qu'on ne voit pas l'épaisseur de la glace."
}
}
### 数据字段
所有划分的数据字段均保持一致。
#### iwslt2017-ar-en
- `translation`: 多语言字符串变量,支持的语言包括`ar`:阿拉伯语(Arabic)、`en`:英语(English)。
#### iwslt2017-de-en
- `translation`: 多语言字符串变量,支持的语言包括`de`:德语(German)、`en`:英语(English)。
#### iwslt2017-en-ar
- `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`ar`:阿拉伯语(Arabic)。
#### iwslt2017-en-de
- `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`de`:德语(German)。
#### iwslt2017-en-fr
- `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`fr`:法语(French)。
### 数据划分
| 数据集子集名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 |
|---------------|-----:|---------:|---:|
|iwslt2017-ar-en|231713| 888|8583|
|iwslt2017-de-en|206112| 888|8079|
|iwslt2017-en-ar|231713| 888|8583|
|iwslt2017-en-de|206112| 888|8079|
|iwslt2017-en-fr|232825| 890|8597|
## 数据集构建
### 构建初衷
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言文本的创作者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可协议信息
知识共享署名-非商业性使用-禁止演绎(CC BY-NC-ND)
详见[TED演讲使用政策](https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy)。
### 引用信息
@inproceedings{cettolo-etal-2017-overview,
title = "Overview of the {IWSLT} 2017 Evaluation Campaign",
author = {Cettolo, Mauro and
Federico, Marcello and
Bentivogli, Luisa and
Niehues, Jan and
St{"u}ker, Sebastian and
Sudoh, Katsuhito and
Yoshino, Koichiro and
Federmann, Christian},
booktitle = "Proceedings of the 14th International Conference on Spoken Language Translation",
month = dec # " 14-15",
year = "2017",
address = "Tokyo, Japan",
publisher = "International Workshop on Spoken Language Translation",
url = "https://aclanthology.org/2017.iwslt-1.1",
pages = "2--14",
}
### 贡献者
感谢[@thomwolf](https://github.com/thomwolf)、[@Narsil](https://github.com/Narsil)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-10-29



