wmt/wmt_t2t

Hugging Face2024-04-04 更新2024-04-20 收录

下载链接：

https://hf-mirror.com/datasets/wmt/wmt_t2t

下载链接

链接失效反馈

官方服务：

资源简介：

WMT T2T数据集是一个用于机器翻译任务的双语数据集，主要包含德语（de）和英语（en）的翻译对。数据集来源于多个扩展数据集，如europarl_bilingual、news_commentary、opus_paracrawl和un_multi。数据集的规模在10M到100M之间，包含训练集、验证集和测试集。数据集的创建目的是为Tensor2Tensor库提供翻译数据，用户可以根据需要选择不同的语言对和子集来构建自定义数据集。

提供机构：

wmt

原始信息汇总

数据集概述

基本信息

名称: WMT T2T
语言: 德语 (de), 英语 (en)
许可证: 未知
多语言性: 翻译
大小: 10M<n<100M

数据集结构

特征:
- translation: 包含德语和英语的多语言字符串
分割:
- train: 4592289 个例子, 1385106499 字节
- validation: 3000 个例子, 736407 字节
- test: 3003 个例子, 777326 字节
下载大小: 835031826 字节
数据集大小: 1386620232 字节

数据源

源数据集:
- europarl_bilingual
- news_commentary
- opus_paracrawl
- un_multi

配置

配置名称: de-en
默认配置: 是
数据文件路径:
- train: de-en/train-*
- validation: de-en/validation-*
- test: de-en/test-*

引用信息

@InProceedings{bojar-EtAl:2014:W14-33, author = {Bojar, Ondrej and Buck, Christian and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Leveling, Johannes and Monz, Christof and Pecina, Pavel and Post, Matt and Saint-Amand, Herve and Soricut, Radu and Specia, Lucia and Tamchyna, Ale {s}}, title = {Findings of the 2014 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = {June}, year = {2014}, address = {Baltimore, Maryland, USA}, publisher = {Association for Computational Linguistics}, pages = {12--58}, url = {http://www.aclweb.org/anthology/W/W14/W14-3302} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集