huy-nh-2000/wikilingua
收藏Hugging Face2024-05-06 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/huy-nh-2000/wikilingua
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: arabic
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 130564388
num_examples: 20441
- name: val
num_bytes: 19181219
num_examples: 2919
- name: test
num_bytes: 37525618
num_examples: 5841
download_size: 99178827
dataset_size: 187271225
- config_name: chinese
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 58571128
num_examples: 13211
- name: val
num_bytes: 8632525
num_examples: 1886
- name: test
num_bytes: 16705476
num_examples: 3775
download_size: 51852671
dataset_size: 83909129
- config_name: czech
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 27245884
num_examples: 5033
- name: val
num_bytes: 4096962
num_examples: 718
- name: test
num_bytes: 7865102
num_examples: 1438
download_size: 43156296
dataset_size: 39207948
- config_name: dutch
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 110732449
num_examples: 21866
- name: val
num_bytes: 16335966
num_examples: 3123
- name: test
num_bytes: 31432481
num_examples: 6248
download_size: 92160306
dataset_size: 158500896
- config_name: english
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
splits:
- name: train
num_bytes: 37982521
num_examples: 16331
- name: val
num_bytes: 32179984
num_examples: 13823
- name: test
num_bytes: 37707114
num_examples: 16331
download_size: 63092163
dataset_size: 107869619
- config_name: french
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 239438189
num_examples: 44556
- name: val
num_bytes: 34600279
num_examples: 6364
- name: test
num_bytes: 68429787
num_examples: 12731
download_size: 196490341
dataset_size: 342468255
- config_name: german
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 209951261
num_examples: 40839
- name: val
num_bytes: 30301576
num_examples: 5833
- name: test
num_bytes: 59576025
num_examples: 11669
download_size: 175266735
dataset_size: 299828862
- config_name: hindi
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 58998583
num_examples: 6942
- name: val
num_bytes: 8594124
num_examples: 991
- name: test
num_bytes: 16683457
num_examples: 1984
download_size: 36144340
dataset_size: 84276164
- config_name: indonesian
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 172388409
num_examples: 33237
- name: val
num_bytes: 25084663
num_examples: 4747
- name: test
num_bytes: 48829681
num_examples: 9497
download_size: 136661413
dataset_size: 246302753
- config_name: italian
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 175094631
num_examples: 35661
- name: val
num_bytes: 25078547
num_examples: 5093
- name: test
num_bytes: 49066797
num_examples: 10189
download_size: 148140644
dataset_size: 249239975
- config_name: japanese
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 47368720
num_examples: 8853
- name: val
num_bytes: 6780358
num_examples: 1264
- name: test
num_bytes: 13278559
num_examples: 2530
download_size: 37306517
dataset_size: 67427637
- config_name: korean
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 47423941
num_examples: 8524
- name: val
num_bytes: 6850782
num_examples: 1216
- name: test
num_bytes: 13442585
num_examples: 2436
download_size: 38467491
dataset_size: 67717308
- config_name: portuguese
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 268972973
num_examples: 57159
- name: val
num_bytes: 38567774
num_examples: 8165
- name: test
num_bytes: 76012276
num_examples: 16331
download_size: 226998962
dataset_size: 383553023
- config_name: russian
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 250351067
num_examples: 37028
- name: val
num_bytes: 36146901
num_examples: 5288
- name: test
num_bytes: 70478851
num_examples: 10580
download_size: 184664294
dataset_size: 356976819
- config_name: spanish
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 394412409
num_examples: 79212
- name: val
num_bytes: 56417703
num_examples: 11316
- name: test
num_bytes: 112641904
num_examples: 22632
download_size: 327359783
dataset_size: 563472016
- config_name: thai
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 83021957
num_examples: 10325
- name: val
num_bytes: 12406174
num_examples: 1475
- name: test
num_bytes: 23563706
num_examples: 2950
download_size: 52830134
dataset_size: 118991837
- config_name: turkish
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 14436992
num_examples: 3148
- name: val
num_bytes: 2062879
num_examples: 449
- name: test
num_bytes: 3951199
num_examples: 900
download_size: 11690789
dataset_size: 20451070
- config_name: vietnamese
features:
- name: id
dtype: string
- name: split
dtype: string
- name: src_en
dtype: string
- name: tgt_en
dtype: string
- name: src_lang
dtype: string
- name: tgt_lang
dtype: string
splits:
- name: train
num_bytes: 81877587
num_examples: 13707
- name: val
num_bytes: 12182671
num_examples: 1957
- name: test
num_bytes: 23468627
num_examples: 3917
download_size: 63197927
dataset_size: 117528885
configs:
- config_name: arabic
data_files:
- split: train
path: arabic/train-*
- split: val
path: arabic/val-*
- split: test
path: arabic/test-*
- config_name: chinese
data_files:
- split: train
path: chinese/train-*
- split: val
path: chinese/val-*
- split: test
path: chinese/test-*
- config_name: czech
data_files:
- split: train
path: czech/train-*
- split: val
path: czech/val-*
- split: test
path: czech/test-*
- config_name: dutch
data_files:
- split: train
path: dutch/train-*
- split: val
path: dutch/val-*
- split: test
path: dutch/test-*
- config_name: english
data_files:
- split: train
path: english/train-*
- split: val
path: english/val-*
- split: test
path: english/test-*
- config_name: french
data_files:
- split: train
path: french/train-*
- split: val
path: french/val-*
- split: test
path: french/test-*
- config_name: german
data_files:
- split: train
path: german/train-*
- split: val
path: german/val-*
- split: test
path: german/test-*
- config_name: hindi
data_files:
- split: train
path: hindi/train-*
- split: val
path: hindi/val-*
- split: test
path: hindi/test-*
- config_name: indonesian
data_files:
- split: train
path: indonesian/train-*
- split: val
path: indonesian/val-*
- split: test
path: indonesian/test-*
- config_name: italian
data_files:
- split: train
path: italian/train-*
- split: val
path: italian/val-*
- split: test
path: italian/test-*
- config_name: japanese
data_files:
- split: train
path: japanese/train-*
- split: val
path: japanese/val-*
- split: test
path: japanese/test-*
- config_name: korean
data_files:
- split: train
path: korean/train-*
- split: val
path: korean/val-*
- split: test
path: korean/test-*
- config_name: portuguese
data_files:
- split: train
path: portuguese/train-*
- split: val
path: portuguese/val-*
- split: test
path: portuguese/test-*
- config_name: russian
data_files:
- split: train
path: russian/train-*
- split: val
path: russian/val-*
- split: test
path: russian/test-*
- config_name: spanish
data_files:
- split: train
path: spanish/train-*
- split: val
path: spanish/val-*
- split: test
path: spanish/test-*
- config_name: thai
data_files:
- split: train
path: thai/train-*
- split: val
path: thai/val-*
- split: test
path: thai/test-*
- config_name: turkish
data_files:
- split: train
path: turkish/train-*
- split: val
path: turkish/val-*
- split: test
path: turkish/test-*
- config_name: vietnamese
data_files:
- split: train
path: vietnamese/train-*
- split: val
path: vietnamese/val-*
- split: test
path: vietnamese/test-*
---
This dataset is a multilingual translation dataset focusing on translation tasks between English and various other languages. Each language configuration includes features such as id, split, src_en, tgt_en, src_lang, and tgt_lang. The dataset is divided into three splits: train, val, and test, each with specified numbers of bytes and examples. The file also lists the download size and dataset size for each language configuration. Supported languages include Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, Turkish, and Vietnamese.
提供机构:
huy-nh-2000
原始信息汇总
数据集概述
阿拉伯语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 20441个例子,130564388字节
- 验证集: 2919个例子,19181219字节
- 测试集: 5841个例子,37525618字节
- 下载大小: 99178827字节
- 数据集大小: 187271225字节
中文数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 13211个例子,58571128字节
- 验证集: 1886个例子,8632525字节
- 测试集: 3775个例子,16705476字节
- 下载大小: 51852671字节
- 数据集大小: 83909129字节
捷克语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 5033个例子,27245884字节
- 验证集: 718个例子,4096962字节
- 测试集: 1438个例子,7865102字节
- 下载大小: 43156296字节
- 数据集大小: 39207948字节
荷兰语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 21866个例子,110732449字节
- 验证集: 3123个例子,16335966字节
- 测试集: 6248个例子,31432481字节
- 下载大小: 92160306字节
- 数据集大小: 158500896字节
英语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- 分割:
- 训练集: 16331个例子,37982521字节
- 验证集: 13823个例子,32179984字节
- 测试集: 16331个例子,37707114字节
- 下载大小: 63092163字节
- 数据集大小: 107869619字节
法语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 44556个例子,239438189字节
- 验证集: 6364个例子,34600279字节
- 测试集: 12731个例子,68429787字节
- 下载大小: 196490341字节
- 数据集大小: 342468255字节
德语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 40839个例子,209951261字节
- 验证集: 5833个例子,30301576字节
- 测试集: 11669个例子,59576025字节
- 下载大小: 175266735字节
- 数据集大小: 299828862字节
印地语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 6942个例子,58998583字节
- 验证集: 991个例子,8594124字节
- 测试集: 1984个例子,16683457字节
- 下载大小: 36144340字节
- 数据集大小: 84276164字节
印度尼西亚语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 33237个例子,172388409字节
- 验证集: 4747个例子,25084663字节
- 测试集: 9497个例子,48829681字节
- 下载大小: 136661413字节
- 数据集大小: 246302753字节
意大利语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 35661个例子,175094631字节
- 验证集: 5093个例子,25078547字节
- 测试集: 10189个例子,49066797字节
- 下载大小: 148140644字节
- 数据集大小: 249239975字节
日语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 8853个例子,47368720字节
- 验证集: 1264个例子,6780358字节
- 测试集: 2530个例子,13278559字节
- 下载大小: 37306517字节
- 数据集大小: 67427637字节
韩语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 8524个例子,47423941字节
- 验证集: 1216个例子,6850782字节
- 测试集: 2436个例子,13442585字节
- 下载大小: 38467491字节
- 数据集大小: 67717308字节
葡萄牙语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 57159个例子,268972973字节
- 验证集: 8165个例子,38567774字节
- 测试集: 16331个例子,76012276字节
- 下载大小: 226998962字节
- 数据集大小: 383553023字节
俄语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 37028个例子,250351067字节
- 验证集: 5288个例子,36146901字节
- 测试集: 10580个例子,70478851字节
- 下载大小: 184664294字节
- 数据集大小: 356976819字节
西班牙语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 79212个例子,394412409字节
- 验证集: 11316个例子,56417703字节
- 测试集: 22632个例子,112641904字节
- 下载大小: 327359783字节
- 数据集大小: 563472016字节
泰语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 10325个例子,83021957字节
- 验证集: 1475个例子,12406174字节
- 测试集: 2950个例子,23563706字节
- 下载大小: 52830134字节
- 数据集大小: 118991837字节
土耳其语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 3148个例子,14436992字节
- 验证集: 449个例子,2062879字节
- 测试集: 900个例子,3951199字节
- 下载大小: 11690789字节
- 数据集大小: 20451070字节
越南语数据集
- 特征:
- id: 字符串类型
- split: 字符串类型
- src_en: 字符串类型
- tgt_en: 字符串类型
- src_lang: 字符串类型
- tgt_lang: 字符串类型
- 分割:
- 训练集: 13707个例子,81877587字节
- 验证集: 1957个例子,12182671字节
- 测试集: 3917个例子,23468627字节
- 下载大小: 63197927字节
- 数据集大小: 117528885字节



