five

iwslt2017

收藏
魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/IWSLT/iwslt2017
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for IWSLT 2017 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://sites.google.com/site/iwsltevaluation2017/TED-tasks](https://sites.google.com/site/iwsltevaluation2017/TED-tasks) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Overview of the IWSLT 2017 Evaluation Campaign](https://aclanthology.org/2017.iwslt-1.1/) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 4.24 GB - **Size of the generated dataset:** 1.14 GB - **Total amount of disk used:** 5.38 GB ### Dataset Summary The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### iwslt2017-ar-en - **Size of downloaded dataset files:** 27.75 MB - **Size of the generated dataset:** 58.74 MB - **Total amount of disk used:** 86.49 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..." } ``` #### iwslt2017-de-en - **Size of downloaded dataset files:** 16.76 MB - **Size of the generated dataset:** 44.43 MB - **Total amount of disk used:** 61.18 MB An example of 'train' looks as follows. ``` { "translation": { "de": "Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür.", "en": "And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful." } } ``` #### iwslt2017-en-ar - **Size of downloaded dataset files:** 29.33 MB - **Size of the generated dataset:** 58.74 MB - **Total amount of disk used:** 88.07 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "translation": "{\"ar\": \"لقد طرت في \\\"القوات الجوية \\\" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!\", \"en\": \"I flew on Air ..." } ``` #### iwslt2017-en-de - **Size of downloaded dataset files:** 16.76 MB - **Size of the generated dataset:** 44.43 MB - **Total amount of disk used:** 61.18 MB An example of 'validation' looks as follows. ``` { "translation": { "de": "Die nächste Folie, die ich Ihnen zeige, ist eine Zeitrafferaufnahme was in den letzten 25 Jahren passiert ist.", "en": "The next slide I show you will be a rapid fast-forward of what's happened over the last 25 years." } } ``` #### iwslt2017-en-fr - **Size of downloaded dataset files:** 27.69 MB - **Size of the generated dataset:** 51.24 MB - **Total amount of disk used:** 78.94 MB An example of 'validation' looks as follows. ``` { "translation": { "en": "But this understates the seriousness of this particular problem because it doesn't show the thickness of the ice.", "fr": "Mais ceci tend à amoindrir le problème parce qu'on ne voit pas l'épaisseur de la glace." } } ``` ### Data Fields The data fields are the same among all splits. #### iwslt2017-ar-en - `translation`: a multilingual `string` variable, with possible languages including `ar`, `en`. #### iwslt2017-de-en - `translation`: a multilingual `string` variable, with possible languages including `de`, `en`. #### iwslt2017-en-ar - `translation`: a multilingual `string` variable, with possible languages including `en`, `ar`. #### iwslt2017-en-de - `translation`: a multilingual `string` variable, with possible languages including `en`, `de`. #### iwslt2017-en-fr - `translation`: a multilingual `string` variable, with possible languages including `en`, `fr`. ### Data Splits | name |train |validation|test| |---------------|-----:|---------:|---:| |iwslt2017-ar-en|231713| 888|8583| |iwslt2017-de-en|206112| 888|8079| |iwslt2017-en-ar|231713| 888|8583| |iwslt2017-en-de|206112| 888|8079| |iwslt2017-en-fr|232825| 890|8597| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information Creative Commons BY-NC-ND See the (TED Talks Usage Policy)[https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy]. ### Citation Information ``` @inproceedings{cettolo-etal-2017-overview, title = "Overview of the {IWSLT} 2017 Evaluation Campaign", author = {Cettolo, Mauro and Federico, Marcello and Bentivogli, Luisa and Niehues, Jan and St{\"u}ker, Sebastian and Sudoh, Katsuhito and Yoshino, Koichiro and Federmann, Christian}, booktitle = "Proceedings of the 14th International Conference on Spoken Language Translation", month = dec # " 14-15", year = "2017", address = "Tokyo, Japan", publisher = "International Workshop on Spoken Language Translation", url = "https://aclanthology.org/2017.iwslt-1.1", pages = "2--14", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@Narsil](https://github.com/Narsil) for adding this dataset.

# IWSLT 2017 数据集卡片 ## 目录 - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## 数据集描述 - **主页:** [https://sites.google.com/site/iwsltevaluation2017/TED-tasks](https://sites.google.com/site/iwsltevaluation2017/TED-tasks) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [Overview of the IWSLT 2017 Evaluation Campaign](https://aclanthology.org/2017.iwslt-1.1/) - **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 4.24 GB - **生成数据集大小:** 1.14 GB - **总磁盘占用空间:** 5.38 GB ### 数据集摘要 IWSLT 2017 多语言任务旨在通过单一机器翻译(Machine Translation, MT)系统完成涵盖英语、德语、荷兰语、意大利语及罗马尼亚语的全方向文本翻译任务,其中包含零样本(zero-shot)翻译。作为非官方任务,该数据集提供英语与阿拉伯语、法语、日语、汉语、德语及韩语之间的常规双语文本翻译服务。 ### 支持任务与基准排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 涉及语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### iwslt2017-ar-en - **下载数据集文件大小:** 27.75 MB - **生成数据集大小:** 58.74 MB - **总磁盘占用空间:** 86.49 MB 训练集的一个示例如下: This example was too long and was cropped: { "translation": "{"ar": "لقد طرت في \"القوات الجوية \" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!", "en": "I flew on Air ..." } #### iwslt2017-de-en - **下载数据集文件大小:** 16.76 MB - **生成数据集大小:** 44.43 MB - **总磁盘占用空间:** 61.18 MB 训练集的一个示例如下: { "translation": { "de": "Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür.", "en": "And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful." } } #### iwslt2017-en-ar - **下载数据集文件大小:** 29.33 MB - **生成数据集大小:** 58.74 MB - **总磁盘占用空间:** 88.07 MB 训练集的一个示例如下: This example was too long and was cropped: { "translation": "{"ar": "لقد طرت في \"القوات الجوية \" لمدة ثمان سنوات. والآن أجد نفسي مضطرا لخلع حذائي قبل صعود الطائرة!", "en": "I flew on Air ..." } #### iwslt2017-en-de - **下载数据集文件大小:** 16.76 MB - **生成数据集大小:** 44.43 MB - **总磁盘占用空间:** 61.18 MB 验证集的一个示例如下: { "translation": { "de": "Die nächste Folie, die ich Ihnen zeige, ist eine Zeitrafferaufnahme was in den letzten 25 Jahren passiert ist.", "en": "The next slide I show you will be a rapid fast-forward of what's happened over the last 25 years." } } #### iwslt2017-en-fr - **下载数据集文件大小:** 27.69 MB - **生成数据集大小:** 51.24 MB - **总磁盘占用空间:** 78.94 MB 验证集的一个示例如下: { "translation": { "en": "But this understates the seriousness of this particular problem because it doesn't show the thickness of the ice.", "fr": "Mais ceci tend à amoindrir le problème parce qu'on ne voit pas l'épaisseur de la glace." } } ### 数据字段 所有划分的数据字段均保持一致。 #### iwslt2017-ar-en - `translation`: 多语言字符串变量,支持的语言包括`ar`:阿拉伯语(Arabic)、`en`:英语(English)。 #### iwslt2017-de-en - `translation`: 多语言字符串变量,支持的语言包括`de`:德语(German)、`en`:英语(English)。 #### iwslt2017-en-ar - `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`ar`:阿拉伯语(Arabic)。 #### iwslt2017-en-de - `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`de`:德语(German)。 #### iwslt2017-en-fr - `translation`: 多语言字符串变量,支持的语言包括`en`:英语(English)、`fr`:法语(French)。 ### 数据划分 | 数据集子集名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 | |---------------|-----:|---------:|---:| |iwslt2017-ar-en|231713| 888|8583| |iwslt2017-de-en|206112| 888|8079| |iwslt2017-en-ar|231713| 888|8583| |iwslt2017-en-de|206112| 888|8079| |iwslt2017-en-fr|232825| 890|8597| ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言文本的创作者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可协议信息 知识共享署名-非商业性使用-禁止演绎(CC BY-NC-ND) 详见[TED演讲使用政策](https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy)。 ### 引用信息 @inproceedings{cettolo-etal-2017-overview, title = "Overview of the {IWSLT} 2017 Evaluation Campaign", author = {Cettolo, Mauro and Federico, Marcello and Bentivogli, Luisa and Niehues, Jan and St{"u}ker, Sebastian and Sudoh, Katsuhito and Yoshino, Koichiro and Federmann, Christian}, booktitle = "Proceedings of the 14th International Conference on Spoken Language Translation", month = dec # " 14-15", year = "2017", address = "Tokyo, Japan", publisher = "International Workshop on Spoken Language Translation", url = "https://aclanthology.org/2017.iwslt-1.1", pages = "2--14", } ### 贡献者 感谢[@thomwolf](https://github.com/thomwolf)、[@Narsil](https://github.com/Narsil)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作