five

wmt/wmt17

收藏
Hugging Face2024-04-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wmt/wmt17
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - cs - de - en - fi - lv - ru - tr - zh license: - unknown multilinguality: - translation size_categories: - 10M<n<100M source_datasets: - extended|europarl_bilingual - extended|news_commentary - extended|setimes - extended|un_multi task_categories: - translation task_ids: [] pretty_name: WMT17 dataset_info: - config_name: cs-en features: - name: translation dtype: translation: languages: - cs - en splits: - name: train num_bytes: 300697615 num_examples: 1018291 - name: validation num_bytes: 707862 num_examples: 2999 - name: test num_bytes: 674422 num_examples: 3005 download_size: 181690407 dataset_size: 302079899 - config_name: de-en features: - name: translation dtype: translation: languages: - de - en splits: - name: train num_bytes: 1715532715 num_examples: 5906184 - name: validation num_bytes: 735508 num_examples: 2999 - name: test num_bytes: 729511 num_examples: 3004 download_size: 1011327465 dataset_size: 1716997734 - config_name: fi-en features: - name: translation dtype: translation: languages: - fi - en splits: - name: train num_bytes: 743854397 num_examples: 2656542 - name: validation num_bytes: 1410507 num_examples: 6000 - name: test num_bytes: 1388820 num_examples: 6004 download_size: 423069132 dataset_size: 746653724 - config_name: lv-en features: - name: translation dtype: translation: languages: - lv - en splits: - name: train num_bytes: 517416244 num_examples: 3567528 - name: validation num_bytes: 544596 num_examples: 2003 - name: test num_bytes: 530466 num_examples: 2001 download_size: 245201883 dataset_size: 518491306 - config_name: ru-en features: - name: translation dtype: translation: languages: - ru - en splits: - name: train num_bytes: 11000055690 num_examples: 24782720 - name: validation num_bytes: 1050669 num_examples: 2998 - name: test num_bytes: 1040187 num_examples: 3001 download_size: 4866529051 dataset_size: 11002146546 - config_name: tr-en features: - name: translation dtype: translation: languages: - tr - en splits: - name: train num_bytes: 60416449 num_examples: 205756 - name: validation num_bytes: 732428 num_examples: 3000 - name: test num_bytes: 752765 num_examples: 3007 download_size: 37706176 dataset_size: 61901642 - config_name: zh-en features: - name: translation dtype: translation: languages: - zh - en splits: - name: train num_bytes: 6336104073 num_examples: 25134743 - name: validation num_bytes: 589583 num_examples: 2002 - name: test num_bytes: 540339 num_examples: 2001 download_size: 3576239952 dataset_size: 6337233995 configs: - config_name: cs-en data_files: - split: train path: cs-en/train-* - split: validation path: cs-en/validation-* - split: test path: cs-en/test-* - config_name: de-en data_files: - split: train path: de-en/train-* - split: validation path: de-en/validation-* - split: test path: de-en/test-* - config_name: fi-en data_files: - split: train path: fi-en/train-* - split: validation path: fi-en/validation-* - split: test path: fi-en/test-* - config_name: lv-en data_files: - split: train path: lv-en/train-* - split: validation path: lv-en/validation-* - split: test path: lv-en/test-* - config_name: ru-en data_files: - split: train path: ru-en/train-* - split: validation path: ru-en/validation-* - split: test path: ru-en/test-* - config_name: tr-en data_files: - split: train path: tr-en/train-* - split: validation path: tr-en/validation-* - split: test path: tr-en/test-* - config_name: zh-en data_files: - split: train path: zh-en/train-* - split: validation path: zh-en/validation-* - split: test path: zh-en/test-* --- # Dataset Card for "wmt17" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [http://www.statmt.org/wmt17/translation-task.html](http://www.statmt.org/wmt17/translation-task.html) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 1.78 GB - **Size of the generated dataset:** 302.09 MB - **Total amount of disk used:** 2.09 GB ### Dataset Summary <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"> <p><b>Warning:</b> There are issues with the Common Crawl corpus data (<a href="https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz">training-parallel-commoncrawl.tgz</a>):</p> <ul> <li>Non-English files contain many English sentences.</li> <li>Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart.</li> </ul> <p>We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains to the expectation that such data has been superseded, primarily by CCMatrix, and to some extent, by ParaCrawl datasets.</p> </div> Translation dataset based on the data from statmt.org. Versions exist for different years using a combination of data sources. The base `wmt` allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows: ```python from datasets import inspect_dataset, load_dataset_builder inspect_dataset("wmt17", "path/to/scripts") builder = load_dataset_builder( "path/to/scripts/wmt_utils.py", language_pair=("fr", "de"), subsets={ datasets.Split.TRAIN: ["commoncrawl_frde"], datasets.Split.VALIDATION: ["euelections_dev2019"], }, ) # Standard version builder.download_and_prepare() ds = builder.as_dataset() # Streamable version ds = builder.as_streaming_dataset() ``` ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### cs-en - **Size of downloaded dataset files:** 1.78 GB - **Size of the generated dataset:** 302.09 MB - **Total amount of disk used:** 2.09 GB An example of 'train' looks as follows. ``` ``` ### Data Fields The data fields are the same among all splits. #### cs-en - `translation`: a multilingual `string` variable, with possible languages including `cs`, `en`. ### Data Splits |name | train |validation|test| |-----|------:|---------:|---:| |cs-en|1018291| 2999|3005| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @InProceedings{bojar-EtAl:2017:WMT1, author = {Bojar, Ond {r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco}, title = {Findings of the 2017 Conference on Machine Translation (WMT17)}, booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {169--214}, url = {http://www.aclweb.org/anthology/W17-4717} } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

annotations_creators: - 无注释 language_creators: - 采集所得 language: - 捷克语(cs) - 德语(de) - 英语(en) - 芬兰语(fi) - 拉脱维亚语(lv) - 俄语(ru) - 土耳其语(tr) - 中文(zh) license: - 未知 multilinguality: - 翻译语料 size_categories: - 1000万<样本数<1亿 source_datasets: - 扩展|欧洲议会双语语料库(europarl_bilingual) - 扩展|新闻评论语料库(news_commentary) - 扩展|Setimes语料库(setimes) - 扩展|联合国多国语言语料库(un_multi) task_categories: - 机器翻译 task_ids: [] pretty_name: WMT17 dataset_info: - config_name: 捷克语-英语(cs-en) features: - name: translation dtype: translation: languages: - cs - en splits: - name: train num_bytes: 300697615 num_examples: 1018291 - name: validation num_bytes: 707862 num_examples: 2999 - name: test num_bytes: 674422 num_examples: 3005 download_size: 181690407 dataset_size: 302079899 - config_name: 德语-英语(de-en) features: - name: translation dtype: translation: languages: - de - en splits: - name: train num_bytes: 1715532715 num_examples: 5906184 - name: validation num_bytes: 735508 num_examples: 2999 - name: test num_bytes: 729511 num_examples: 3004 download_size: 1011327465 dataset_size: 1716997734 - config_name: 芬兰语-英语(fi-en) features: - name: translation dtype: translation: languages: - fi - en splits: - name: train num_bytes: 743854397 num_examples: 2656542 - name: validation num_bytes: 1410507 num_examples: 6000 - name: test num_bytes: 1388820 num_examples: 6004 download_size: 423069132 dataset_size: 746653724 - config_name: 拉脱维亚语-英语(lv-en) features: - name: translation dtype: translation: languages: - lv - en splits: - name: train num_bytes: 517416244 num_examples: 3567528 - name: validation num_bytes: 544596 num_examples: 2003 - name: test num_bytes: 530466 num_examples: 2001 download_size: 245201883 dataset_size: 518491306 - config_name: 俄语-英语(ru-en) features: - name: translation dtype: translation: languages: - ru - en splits: - name: train num_bytes: 11000055690 num_examples: 24782720 - name: validation num_bytes: 1050669 num_examples: 2998 - name: test num_bytes: 1040187 num_examples: 3001 download_size: 4866529051 dataset_size: 11002146546 - config_name: 土耳其语-英语(tr-en) features: - name: translation dtype: translation: languages: - tr - en splits: - name: train num_bytes: 60416449 num_examples: 205756 - name: validation num_bytes: 732428 num_examples: 3000 - name: test num_bytes: 752765 num_examples: 3007 download_size: 37706176 dataset_size: 61901642 - config_name: 中文-英语(zh-en) features: - name: translation dtype: translation: languages: - zh - en splits: - name: train num_bytes: 6336104073 num_examples: 25134743 - name: validation num_bytes: 589583 num_examples: 2002 - name: test num_bytes: 540339 num_examples: 2001 download_size: 3576239952 dataset_size: 6337233995 configs: - config_name: 捷克语-英语(cs-en) data_files: - split: train path: cs-en/train-* - split: validation path: cs-en/validation-* - split: test path: cs-en/test-* - config_name: 德语-英语(de-en) data_files: - split: train path: de-en/train-* - split: validation path: de-en/validation-* - split: test path: de-en/test-* - config_name: 芬兰语-英语(fi-en) data_files: - split: train path: fi-en/train-* - split: validation path: fi-en/validation-* - split: test path: fi-en/test-* - config_name: 拉脱维亚语-英语(lv-en) data_files: - split: train path: lv-en/train-* - split: validation path: lv-en/validation-* - split: test path: lv-en/test-* - config_name: 俄语-英语(ru-en) data_files: - split: train path: ru-en/train-* - split: validation path: ru-en/validation-* - split: test path: ru-en/test-* - config_name: 土耳其语-英语(tr-en) data_files: - split: train path: tr-en/train-* - split: validation path: tr-en/validation-* - split: test path: tr-en/test-* - config_name: 中文-英语(zh-en) data_files: - split: train path: zh-en/train-* - split: validation path: zh-en/validation-* - split: test path: zh-en/test-* # 数据集卡片:"WMT17" ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [支持语言](#支持语言) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据拆分](#数据拆分) - [数据集构建](#数据集构建) - [数据遴选依据](#数据遴选依据) - [源数据](#源数据) - [注释信息](#注释信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可证信息](#许可证信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页:** [http://www.statmt.org/wmt17/translation-task.html](http://www.statmt.org/wmt17/translation-task.html) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **论文:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集总大小:** 1.78 GB - **生成后数据集总大小:** 302.09 MB - **占用磁盘总空间:** 2.09 GB ### 数据集概述 <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray--400"> <p><b>警告:</b> Common Crawl语料库数据(<a href="https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz">training-parallel-commoncrawl.tgz</a>)存在以下问题:</p> <ul> <li>非英语文件中包含大量英语句子。</li> <li>其对应的“平行”英语句子未对齐,二者之间无关联。</li> </ul> <p>我们已联系WMT主办方,对方表示暂无更新该Common Crawl语料库的计划,其理由是该类数据已被CCMatrix以及一定程度上的ParaCrawl数据集所替代。</p> </div> 本翻译数据集基于statmt.org提供的数据构建。不同年份的WMT数据集版本通过整合多源数据构建,基础的`wmt`模块支持通过自定义选择数据/语言对来生成定制化数据集,具体实现方式如下: python from datasets import inspect_dataset, load_dataset_builder inspect_dataset("wmt17", "path/to/scripts") builder = load_dataset_builder( "path/to/scripts/wmt_utils.py", language_pair=("fr", "de"), subsets={ datasets.Split.TRAIN: ["commoncrawl_frde"], datasets.Split.VALIDATION: ["euelections_dev2019"], }, ) # 标准版本 builder.download_and_prepare() ds = builder.as_dataset() # 流式版本 ds = builder.as_streaming_dataset() ### 支持任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 支持语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据样例 #### 捷克语-英语(cs-en) - **下载数据集大小:** 1.78 GB - **生成后数据集大小:** 302.09 MB - **占用磁盘总空间:** 2.09 GB 训练集样例如下: ### 数据字段 所有拆分的数据字段格式均保持一致。 #### 捷克语-英语(cs-en) - `translation`: 多语言字符串类型变量,支持的语言包括捷克语(cs)与英语(en)。 ### 数据拆分 |拆分名称 | 训练集样本数 |验证集样本数|测试集样本数| |-----|------:|---------:|---:| |cs-en|1018291| 2999|3005| ## 数据集构建 ### 数据遴选依据 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据采集与归一化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言文本创作者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释信息 #### 注释流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 注释人员是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可证信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @InProceedings{bojar-EtAl:2017:WMT1, author = {Bojar, Ond {r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco}, title = {Findings of the 2017 Conference on Machine Translation (WMT17)}, booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {169--214}, url = {http://www.aclweb.org/anthology/W17-4717} } ### 贡献者 感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@thomwolf](https://github.com/thomwolf) 为本数据集的添加工作。
提供机构:
wmt
原始信息汇总

数据集概述

基本信息

  • 名称: WMT17
  • 语言: 支持多种语言,包括cs, de, en, fi, lv, ru, tr, zh
  • 许可: 未知
  • 多语言性: 翻译
  • 大小: 10M<n<100M

数据集结构

  • 任务类别: 翻译
  • 配置: 多个语言对,如cs-en, de-en, fi-en等
  • 特征: 每个配置包含一个名为translation的特征,类型为字符串,支持的语言对在每个配置中指定
  • 数据分割: 每个配置包含训练、验证和测试集,详细数据如下:
配置名称 训练集字节数 训练集示例数 验证集字节数 验证集示例数 测试集字节数 测试集示例数
cs-en 300697615 1018291 707862 2999 674422 3005
de-en 1715532715 5906184 735508 2999 729511 3004
fi-en 743854397 2656542 1410507 6000 1388820 6004
lv-en 517416244 3567528 544596 2003 530466 2001
ru-en 11000055690 24782720 1050669 2998 1040187 3001
tr-en 60416449 205756 732428 3000 752765 3007
zh-en 6336104073 25134743 589583 2002 540339 2001

数据集创建

  • 来源数据: 扩展自多个数据集,包括europarl_bilingual, news_commentary, setimes, un_multi
  • 注释: 无注释
  • 个人和敏感信息: 未提及

使用考虑

  • 社会影响: 未提及
  • 偏见讨论: 未提及
  • 其他已知限制: 未提及

附加信息

  • 数据集管理员: 未提及
  • 许可信息: 未知
  • 引用信息: 提供了一个引用格式,用于学术引用
  • 贡献者: 感谢@patrickvonplaten, @thomwolf的贡献
搜集汇总
数据集介绍
main_image_url
构建方式
WMT17数据集的构建基于statmt.org提供的数据,涵盖了多种语言对的翻译任务。该数据集通过整合多个来源的数据,包括Europarl、News Commentary、SETimes和UN Multi,形成了一个大规模的多语言翻译语料库。每个语言对的配置包括训练、验证和测试集,确保了数据集的完整性和多样性。
特点
WMT17数据集的一个显著特点是其广泛的语言覆盖,包括捷克语、德语、英语、芬兰语、拉脱维亚语、俄语、土耳其语和中文等多种语言。此外,数据集的规模庞大,训练集的样本数量从数十万到数千万不等,为机器翻译模型的训练提供了丰富的资源。
使用方法
使用WMT17数据集时,用户可以通过HuggingFace的datasets库加载特定语言对的配置。例如,使用Python代码可以轻松加载和准备数据集,进行模型训练和评估。数据集的结构清晰,包含翻译字段,便于直接应用于机器翻译任务。
背景与挑战
背景概述
WMT17数据集是由欧洲机器翻译会议(WMT)于2017年发布的一个大规模多语言翻译数据集。该数据集的主要研究人员和机构包括Ondřej Bojar、Rajen Chatterjee、Christian Federmann等,他们通过整合多个来源的数据,如Europarl、News Commentary、SETimes和UN Multi,构建了一个涵盖多种语言对的高质量翻译语料库。WMT17的核心研究问题是如何提高机器翻译系统的性能,特别是在多语言环境下的翻译准确性和流畅性。该数据集的发布对机器翻译领域产生了深远影响,为研究人员提供了一个标准化的基准,促进了翻译模型的进一步优化和创新。
当前挑战
WMT17数据集在构建过程中面临了多个挑战。首先,数据来源的多样性导致了数据质量的不一致,特别是在Common Crawl数据中存在大量非对齐的句子,这增加了数据清洗和预处理的复杂性。其次,多语言翻译任务本身具有较高的难度,不同语言之间的语法结构和文化背景差异使得翻译模型的训练更加复杂。此外,数据集的规模庞大,涉及多种语言对,这要求研究人员在处理和存储数据时具备高效的技术手段。最后,数据集的更新和维护也是一个持续的挑战,特别是在新数据源不断涌现的情况下,如何保持数据集的前沿性和实用性是一个重要的研究方向。
常用场景
经典使用场景
在机器翻译领域,WMT17数据集以其丰富的多语言对齐文本成为经典资源。该数据集广泛应用于训练和评估机器翻译模型,特别是在跨语言信息检索和多语言文本处理任务中。通过提供高质量的平行语料库,WMT17数据集为研究人员和开发者提供了强大的工具,以提升翻译系统的准确性和效率。
衍生相关工作
基于WMT17数据集,许多后续研究工作得以展开。例如,研究人员利用该数据集开发了新的翻译模型,提升了翻译质量。此外,WMT17还启发了对多语言数据处理和分析方法的研究,推动了自然语言处理技术的发展。这些衍生工作不仅丰富了学术研究,也为实际应用提供了新的解决方案。
数据集最近研究
最新研究方向
在机器翻译领域,WMT17数据集的最新研究方向主要集中在多语言翻译模型的优化与扩展。随着全球化的推进,跨语言交流的需求日益增长,研究人员致力于提升翻译系统的准确性和效率。近年来,基于Transformer架构的多语言模型如mBERT和XLM-R,通过共享参数和多任务学习,显著提高了多语言翻译的性能。此外,数据增强技术和预训练模型的结合,使得模型在处理低资源语言对时表现更为出色。这些前沿技术的应用,不仅推动了机器翻译技术的发展,也为全球信息的无缝交流提供了强有力的支持。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作