CohereForAI/aya_collection_language_split
收藏Hugging Face2024-06-28 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/CohereForAI/aya_collection_language_split
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ace
- afr
- amh
- ara
- aze
- ban
- bbc
- bel
- bem
- ben
- bjn
- bul
- cat
- ceb
- ces
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fil
- fin
- fon
- fra
- gla
- gle
- glg
- guj
- hat
- hau
- heb
- hin
- hrv
- hun
- hye
- ibo
- ind
- isl
- ita
- jav
- jpn
- kan
- kas
- kat
- kau
- kaz
- khm
- kin
- kir
- kor
- kur
- lao
- lav
- lij
- lit
- ltz
- mad
- mal
- man
- mar
- min
- mkd
- mlg
- mlt
- mon
- mri
- msa
- mya
- nep
- nij
- nld
- nor
- nso
- nya
- pan
- pes
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- srp
- sun
- swa
- swe
- tam
- taq
- tel
- tgk
- tha
- tur
- twi
- ukr
- urd
- uzb
- vie
- wol
- xho
- yid
- yor
- zho
- zul
license: apache-2.0
dataset_info:
- config_name: achinese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4777872484
num_examples: 7145730
- name: validation
num_bytes: 399703157
num_examples: 545944
- name: test
num_bytes: 438143574
num_examples: 550610
download_size: 2233825990
dataset_size: 5615719215
- config_name: afrikaans
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1894924665
num_examples: 3577285
- name: validation
num_bytes: 156737548
num_examples: 273427
- name: test
num_bytes: 172092631
num_examples: 275538
download_size: 1034975544
dataset_size: 2223754844
- config_name: algerian_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 1123844
num_examples: 3302
- name: validation
num_bytes: 282474
num_examples: 828
- name: test
num_bytes: 660436
num_examples: 1916
download_size: 942250
dataset_size: 2066754
- config_name: amharic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2867327168
num_examples: 3589993
- name: validation
num_bytes: 235817916
num_examples: 276505
- name: test
num_bytes: 265219081
num_examples: 280178
download_size: 1340859845
dataset_size: 3368364165
- config_name: armenian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3092321567
num_examples: 3576382
- name: validation
num_bytes: 256070205
num_examples: 272872
- name: test
num_bytes: 287127303
num_examples: 277968
download_size: 1396875621
dataset_size: 3635519075
- config_name: balinese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 335222
num_examples: 1000
- name: validation
num_bytes: 67729
num_examples: 200
- name: test
num_bytes: 267606
num_examples: 800
download_size: 261161
dataset_size: 670557
- config_name: banjar
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4896784925
num_examples: 7145730
- name: validation
num_bytes: 407788290
num_examples: 545944
- name: test
num_bytes: 448059987
num_examples: 550610
download_size: 2315045966
dataset_size: 5752633202
- config_name: basque
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1741927285
num_examples: 3573304
- name: validation
num_bytes: 146422247
num_examples: 272872
- name: test
num_bytes: 160617999
num_examples: 274905
download_size: 955378830
dataset_size: 2048967531
- config_name: belarusian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2964962848
num_examples: 3589912
- name: validation
num_bytes: 247498405
num_examples: 274387
- name: test
num_bytes: 272080740
num_examples: 277116
download_size: 1448894856
dataset_size: 3484541993
- config_name: bemba
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 37604
num_examples: 231
- name: validation
num_bytes: 38827
num_examples: 233
- name: test
num_bytes: 50320
num_examples: 312
download_size: 59925
dataset_size: 126751
- config_name: bengali
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4321318392
num_examples: 3601287
- name: validation
num_bytes: 366014588
num_examples: 274546
- name: test
num_bytes: 409983047
num_examples: 276504
download_size: 1609211542
dataset_size: 5097316027
- config_name: bulgarian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2976574500
num_examples: 3602878
- name: validation
num_bytes: 252696998
num_examples: 276385
- name: test
num_bytes: 277603347
num_examples: 278601
download_size: 1396874342
dataset_size: 3506874845
- config_name: burmese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4395135264
num_examples: 3572837
- name: validation
num_bytes: 371771210
num_examples: 272872
- name: test
num_bytes: 415414624
num_examples: 274905
download_size: 1584019542
dataset_size: 5182321098
- config_name: cantonese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1514163853
num_examples: 3572365
- name: validation
num_bytes: 127080943
num_examples: 272872
- name: test
num_bytes: 139900667
num_examples: 274905
download_size: 926620800
dataset_size: 1781145463
- config_name: catalan
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2003489637
num_examples: 3625537
- name: validation
num_bytes: 167708237
num_examples: 280507
- name: test
num_bytes: 182829005
num_examples: 280998
download_size: 1098892975
dataset_size: 2354026879
- config_name: cebuano
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2114801493
num_examples: 3573092
- name: validation
num_bytes: 177057927
num_examples: 272872
- name: test
num_bytes: 194480788
num_examples: 274905
download_size: 1079929756
dataset_size: 2486340208
- config_name: central_kanuri
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5293400941
num_examples: 7144730
- name: validation
num_bytes: 443645193
num_examples: 545744
- name: test
num_bytes: 481978035
num_examples: 549810
download_size: 2530333511
dataset_size: 6219024169
- config_name: central_khmer
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4308880945
num_examples: 3572365
- name: validation
num_bytes: 361390828
num_examples: 272872
- name: test
num_bytes: 402035117
num_examples: 274905
download_size: 1671833499
dataset_size: 5072306890
- config_name: central_kurdish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2989432145
num_examples: 3572444
- name: validation
num_bytes: 251416139
num_examples: 272872
- name: test
num_bytes: 279251698
num_examples: 274905
download_size: 1345601761
dataset_size: 3520099982
- config_name: chinese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 48479164
num_examples: 58941
- name: validation
num_bytes: 6094381
num_examples: 7397
- name: test
num_bytes: 7564241
num_examples: 8634
download_size: 33906872
dataset_size: 62137786
- config_name: croatian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 7496901
num_examples: 6913
- name: validation
num_bytes: 1048919
num_examples: 959
- name: test
num_bytes: 1344439
num_examples: 1135
download_size: 1732429
dataset_size: 9890259
- config_name: czech
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2252022647
num_examples: 3719214
- name: validation
num_bytes: 167604939
num_examples: 286371
- name: test
num_bytes: 210435954
num_examples: 294161
download_size: 1384567896
dataset_size: 2630063540
- config_name: danish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1849189467
num_examples: 3601900
- name: validation
num_bytes: 154056275
num_examples: 276495
- name: test
num_bytes: 167876603
num_examples: 278154
download_size: 1027097230
dataset_size: 2171122345
- config_name: dutch
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2030569893
num_examples: 3736938
- name: validation
num_bytes: 170802711
num_examples: 289696
- name: test
num_bytes: 224723818
num_examples: 315422
download_size: 1155491095
dataset_size: 2426096422
- config_name: eastern_yiddish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3438789221
num_examples: 3572365
- name: validation
num_bytes: 291234897
num_examples: 272872
- name: test
num_bytes: 320685628
num_examples: 274905
download_size: 1541036441
dataset_size: 4050709746
- config_name: egyptian_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2483158544
num_examples: 3572894
- name: validation
num_bytes: 205813835
num_examples: 272872
- name: test
num_bytes: 228781109
num_examples: 274905
download_size: 1206386937
dataset_size: 2917753488
- config_name: english
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: validation
num_bytes: 1128193367
num_examples: 1566890
- name: test
num_bytes: 1096821940
num_examples: 1581136
- name: train
num_bytes: 12429894980
num_examples: 14693823
download_size: 7387226092
dataset_size: 14654910287
- config_name: esperanto
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1842012169
num_examples: 3572365
- name: validation
num_bytes: 154223679
num_examples: 272872
- name: test
num_bytes: 168686341
num_examples: 274905
download_size: 1016436272
dataset_size: 2164922189
- config_name: estonian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1742541505
num_examples: 3572365
- name: validation
num_bytes: 146624244
num_examples: 272872
- name: test
num_bytes: 160222146
num_examples: 274905
download_size: 1005176026
dataset_size: 2049387895
- config_name: filipino
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 535647
num_examples: 1241
- name: test
num_bytes: 214434
num_examples: 220
download_size: 301691
dataset_size: 750081
- config_name: finnish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1953535763
num_examples: 3939941
- name: validation
num_bytes: 170050074
num_examples: 317866
- name: test
num_bytes: 185236179
num_examples: 320972
download_size: 1102957613
dataset_size: 2308822016
- config_name: fon
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 37822
num_examples: 250
- name: validation
num_bytes: 39298
num_examples: 256
- name: test
num_bytes: 49988
num_examples: 339
download_size: 58525
dataset_size: 127108
- config_name: french
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4221754220
num_examples: 4285094
- name: validation
num_bytes: 236528205
num_examples: 327863
- name: test
num_bytes: 267616539
num_examples: 344127
download_size: 2466958656
dataset_size: 4725898964
- config_name: galician
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1910420859
num_examples: 3572365
- name: validation
num_bytes: 158236862
num_examples: 272872
- name: test
num_bytes: 172889464
num_examples: 274905
download_size: 1045134255
dataset_size: 2241547185
- config_name: georgian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4050312890
num_examples: 3572365
- name: validation
num_bytes: 336208596
num_examples: 272872
- name: test
num_bytes: 377215919
num_examples: 274905
download_size: 1532379645
dataset_size: 4763737405
- config_name: german
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4835849859
num_examples: 4689989
- name: validation
num_bytes: 271507778
num_examples: 367838
- name: test
num_bytes: 309636800
num_examples: 389278
download_size: 2916001621
dataset_size: 5416994437
- config_name: greek
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3279139380
num_examples: 3606249
- name: validation
num_bytes: 277100008
num_examples: 275776
- name: test
num_bytes: 305255607
num_examples: 279031
download_size: 1564810277
dataset_size: 3861494995
- config_name: gujarati
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4071303520
num_examples: 3578511
- name: validation
num_bytes: 343022345
num_examples: 272872
- name: test
num_bytes: 383553796
num_examples: 274905
download_size: 1574047934
dataset_size: 4797879661
- config_name: haitian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1798238955
num_examples: 3572471
- name: validation
num_bytes: 148501230
num_examples: 272872
- name: test
num_bytes: 163806209
num_examples: 274905
download_size: 944911106
dataset_size: 2110546394
- config_name: halh_mongolian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2968321741
num_examples: 3572365
- name: validation
num_bytes: 249388427
num_examples: 272872
- name: test
num_bytes: 274273975
num_examples: 274905
download_size: 1354713745
dataset_size: 3491984143
- config_name: hausa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1959088278
num_examples: 3608883
- name: validation
num_bytes: 164773493
num_examples: 279083
- name: test
num_bytes: 184494937
num_examples: 287084
download_size: 1002050510
dataset_size: 2308356708
- config_name: hebrew
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2396802100
num_examples: 3658066
- name: validation
num_bytes: 199963209
num_examples: 282157
- name: test
num_bytes: 220517866
num_examples: 283385
download_size: 1173201045
dataset_size: 2817283175
- config_name: hindi
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5635800546
num_examples: 3772864
- name: validation
num_bytes: 366584523
num_examples: 283272
- name: test
num_bytes: 753622295
num_examples: 325548
download_size: 1940796804
dataset_size: 6756007364
- config_name: hungarian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1955970175
num_examples: 3637911
- name: validation
num_bytes: 164287856
num_examples: 280414
- name: test
num_bytes: 181236730
num_examples: 283954
download_size: 1118657007
dataset_size: 2301494761
- config_name: icelandic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1857557888
num_examples: 3572365
- name: validation
num_bytes: 155953512
num_examples: 272872
- name: test
num_bytes: 169989748
num_examples: 274905
download_size: 1215565930
dataset_size: 2183501148
- config_name: igbo
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2084831180
num_examples: 3597292
- name: validation
num_bytes: 172285334
num_examples: 277247
- name: test
num_bytes: 190702236
num_examples: 283449
download_size: 1028229109
dataset_size: 2447818750
- config_name: indonesian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1962831442
num_examples: 3610078
- name: validation
num_bytes: 163064972
num_examples: 276684
- name: test
num_bytes: 179566560
num_examples: 279875
download_size: 1007888568
dataset_size: 2305462974
- config_name: iranian_persian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3293040883
num_examples: 3785250
- name: validation
num_bytes: 267693067
num_examples: 289295
- name: test
num_bytes: 294289231
num_examples: 292695
download_size: 1564790357
dataset_size: 3855023181
- config_name: irish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2029806749
num_examples: 3573610
- name: validation
num_bytes: 170329030
num_examples: 272872
- name: test
num_bytes: 186316197
num_examples: 274905
download_size: 1113767898
dataset_size: 2386451976
- config_name: italian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2142342173
num_examples: 3890852
- name: validation
num_bytes: 184251381
num_examples: 311008
- name: test
num_bytes: 204453494
num_examples: 324702
download_size: 1207957366
dataset_size: 2531047048
- config_name: japanese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3513120381
num_examples: 6218459
- name: validation
num_bytes: 185953952
num_examples: 295333
- name: test
num_bytes: 207849832
num_examples: 305786
download_size: 1750470294
dataset_size: 3906924165
- config_name: javanese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1895566330
num_examples: 3573441
- name: validation
num_bytes: 156491096
num_examples: 272872
- name: test
num_bytes: 171647059
num_examples: 274905
download_size: 965841736
dataset_size: 2223704485
- config_name: kannada
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4601878209
num_examples: 3573855
- name: validation
num_bytes: 389144937
num_examples: 272872
- name: test
num_bytes: 433081749
num_examples: 274905
download_size: 1686041976
dataset_size: 5424104895
- config_name: kashmiri
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2956029543
num_examples: 3572365
- name: validation
num_bytes: 247155493
num_examples: 272872
- name: test
num_bytes: 272804294
num_examples: 274905
download_size: 1423960224
dataset_size: 3475989330
- config_name: kazakh
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2910190147
num_examples: 3572365
- name: validation
num_bytes: 242198704
num_examples: 272872
- name: test
num_bytes: 268312410
num_examples: 274905
download_size: 1339080618
dataset_size: 3420701261
- config_name: kinyarwanda
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 2303689
num_examples: 6859
- name: validation
num_bytes: 614384
num_examples: 1911
- name: test
num_bytes: 758055
num_examples: 2395
download_size: 1051641
dataset_size: 3676128
- config_name: korean
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2164270878
num_examples: 3605894
- name: validation
num_bytes: 182708679
num_examples: 276202
- name: test
num_bytes: 202554385
num_examples: 279418
download_size: 1147898768
dataset_size: 2549533942
- config_name: kyrgyz
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2953388369
num_examples: 3580987
- name: validation
num_bytes: 245339337
num_examples: 272872
- name: test
num_bytes: 270723246
num_examples: 274905
download_size: 1380773627
dataset_size: 3469450952
- config_name: lao
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3868618069
num_examples: 3572365
- name: validation
num_bytes: 324254376
num_examples: 272872
- name: test
num_bytes: 360931022
num_examples: 274905
download_size: 3595752162
dataset_size: 4553803467
- config_name: ligurian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 3159946
num_examples: 5955
- name: validation
num_bytes: 146833
num_examples: 217
- name: test
num_bytes: 173794
num_examples: 237
download_size: 1608513
dataset_size: 3480573
- config_name: lithuanian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1846675209
num_examples: 3573281
- name: validation
num_bytes: 155015338
num_examples: 272872
- name: test
num_bytes: 169208163
num_examples: 274905
download_size: 1056146665
dataset_size: 2170898710
- config_name: luxembourgish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2040321216
num_examples: 3572365
- name: validation
num_bytes: 170415841
num_examples: 272872
- name: test
num_bytes: 185691773
num_examples: 274905
download_size: 1109294633
dataset_size: 2396428830
- config_name: macedonian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3019539587
num_examples: 3572365
- name: validation
num_bytes: 253607831
num_examples: 272872
- name: test
num_bytes: 278963202
num_examples: 274905
download_size: 1381396890
dataset_size: 3552110620
- config_name: madurese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 336468
num_examples: 1000
- name: validation
num_bytes: 68004
num_examples: 200
- name: test
num_bytes: 269186
num_examples: 800
download_size: 238530
dataset_size: 673658
- config_name: malayalam
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4622727242
num_examples: 3577960
- name: validation
num_bytes: 381952641
num_examples: 273046
- name: test
num_bytes: 426486472
num_examples: 275232
download_size: 1719034789
dataset_size: 5431166355
- config_name: maltese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1993868744
num_examples: 3572365
- name: validation
num_bytes: 164474761
num_examples: 272872
- name: test
num_bytes: 180395631
num_examples: 274905
download_size: 1113361607
dataset_size: 2338739136
- config_name: manipuri
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4440413020
num_examples: 3572365
- name: validation
num_bytes: 379264818
num_examples: 272872
- name: test
num_bytes: 420006813
num_examples: 274905
download_size: 1625079083
dataset_size: 5239684651
- config_name: maori
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2033504713
num_examples: 3572365
- name: validation
num_bytes: 167628344
num_examples: 272872
- name: test
num_bytes: 183733568
num_examples: 274905
download_size: 996144209
dataset_size: 2384866625
- config_name: marathi
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4122741322
num_examples: 3579228
- name: validation
num_bytes: 342811505
num_examples: 272995
- name: test
num_bytes: 385723937
num_examples: 275142
download_size: 1598696436
dataset_size: 4851276764
- config_name: mesopotamian_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2577270729
num_examples: 3572365
- name: validation
num_bytes: 215365338
num_examples: 272872
- name: test
num_bytes: 238778008
num_examples: 274905
download_size: 1283329900
dataset_size: 3031414075
- config_name: minangkabau
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3844428273
num_examples: 5954148
- name: validation
num_bytes: 297124535
num_examples: 399598
- name: test
num_bytes: 337144517
num_examples: 401642
download_size: 1382456504
dataset_size: 4478697325
- config_name: moroccan_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2573747160
num_examples: 3591621
- name: validation
num_bytes: 215002390
num_examples: 273860
- name: test
num_bytes: 238263257
num_examples: 280827
download_size: 1245740016
dataset_size: 3027012807
- config_name: mozambican_portuguese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 2081708
num_examples: 6126
- name: validation
num_bytes: 525706
num_examples: 1534
- name: test
num_bytes: 2343090
num_examples: 7324
download_size: 1354082
dataset_size: 4950504
- config_name: najdi_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2445883805
num_examples: 3572501
- name: validation
num_bytes: 201423105
num_examples: 272872
- name: test
num_bytes: 223867052
num_examples: 274905
download_size: 1179337507
dataset_size: 2871173962
- config_name: nepali
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4006828125
num_examples: 3576367
- name: validation
num_bytes: 333796022
num_examples: 272872
- name: test
num_bytes: 373245075
num_examples: 274905
download_size: 1488954451
dataset_size: 4713869222
- config_name: ngaju
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 330693
num_examples: 1000
- name: validation
num_bytes: 67348
num_examples: 200
- name: test
num_bytes: 265722
num_examples: 800
download_size: 229728
dataset_size: 663763
- config_name: north_azerbaijani
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2006618778
num_examples: 3572365
- name: validation
num_bytes: 164786888
num_examples: 272872
- name: test
num_bytes: 181509957
num_examples: 274905
download_size: 1058557237
dataset_size: 2352915623
- config_name: north_levantine_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2396885807
num_examples: 3572365
- name: validation
num_bytes: 197809922
num_examples: 272872
- name: test
num_bytes: 219933368
num_examples: 274905
download_size: 1164623854
dataset_size: 2814629097
- config_name: northern_kurdish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1953648075
num_examples: 3572365
- name: validation
num_bytes: 163568866
num_examples: 272872
- name: test
num_bytes: 178862810
num_examples: 274905
download_size: 1053199711
dataset_size: 2296079751
- config_name: northern_sotho
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2126728358
num_examples: 3572506
- name: validation
num_bytes: 177710400
num_examples: 272872
- name: test
num_bytes: 194185170
num_examples: 274905
download_size: 1106886156
dataset_size: 2498623928
- config_name: northern_uzbek
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1919223589
num_examples: 3572365
- name: validation
num_bytes: 159059599
num_examples: 272872
- name: test
num_bytes: 174264291
num_examples: 274905
download_size: 1028630473
dataset_size: 2252547479
- config_name: norwegian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 33000285
num_examples: 59637
- name: validation
num_bytes: 3295687
num_examples: 6102
- name: test
num_bytes: 3548936
num_examples: 6613
download_size: 39236046
dataset_size: 39844908
- config_name: norwegian_bokmal
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1827550871
num_examples: 3572365
- name: validation
num_bytes: 149879088
num_examples: 272872
- name: test
num_bytes: 163549957
num_examples: 274905
download_size: 1011292704
dataset_size: 2140979916
- config_name: norwegian_nynorsk
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1744404224
num_examples: 3572365
- name: validation
num_bytes: 146137474
num_examples: 272872
- name: test
num_bytes: 158902110
num_examples: 274905
download_size: 992499567
dataset_size: 2049443808
- config_name: nyanja
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 516017
num_examples: 688
download_size: 275517
dataset_size: 516017
- config_name: panjabi
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 23815881
num_examples: 8541
download_size: 8978869
dataset_size: 23815881
- config_name: plateau_malagasy
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2139257120
num_examples: 3586962
- name: validation
num_bytes: 176626339
num_examples: 272872
- name: test
num_bytes: 193300637
num_examples: 274905
download_size: 1052260977
dataset_size: 2509184096
- config_name: polish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2067411091
num_examples: 3841451
- name: validation
num_bytes: 174849208
num_examples: 300161
- name: test
num_bytes: 197728084
num_examples: 312516
download_size: 1223143004
dataset_size: 2439988383
- config_name: portuguese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2046373181
num_examples: 3786062
- name: validation
num_bytes: 178599813
num_examples: 302603
- name: test
num_bytes: 197857567
num_examples: 312922
download_size: 1145224287
dataset_size: 2422830561
- config_name: romanian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1996007764
num_examples: 3602212
- name: validation
num_bytes: 166610246
num_examples: 275737
- name: test
num_bytes: 182639344
num_examples: 278552
download_size: 1117137359
dataset_size: 2345257354
- config_name: russian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3458190964
num_examples: 4005166
- name: validation
num_bytes: 301791957
num_examples: 322325
- name: test
num_bytes: 343829332
num_examples: 338994
download_size: 1715110629
dataset_size: 4103812253
- config_name: samoan
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2091850649
num_examples: 3572365
- name: validation
num_bytes: 173972380
num_examples: 272872
- name: test
num_bytes: 190476359
num_examples: 274905
download_size: 1040478771
dataset_size: 2456299388
- config_name: scottish_gaelic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2123886658
num_examples: 3572365
- name: validation
num_bytes: 177843868
num_examples: 272872
- name: test
num_bytes: 194208974
num_examples: 274905
download_size: 1119728162
dataset_size: 2495939500
- config_name: serbian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2917308714
num_examples: 3636573
- name: validation
num_bytes: 245864402
num_examples: 278819
- name: test
num_bytes: 269545380
num_examples: 282026
download_size: 1400029022
dataset_size: 3432718496
- config_name: shona
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1933195607
num_examples: 3576309
- name: validation
num_bytes: 159375213
num_examples: 273242
- name: test
num_bytes: 175700269
num_examples: 275643
download_size: 1046682613
dataset_size: 2268271089
- config_name: simplified_chinese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1580183501
num_examples: 3606935
- name: validation
num_bytes: 186290535
num_examples: 288870
- name: test
num_bytes: 168697225
num_examples: 281903
download_size: 998853646
dataset_size: 1935171261
- config_name: sindhi
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2701553602
num_examples: 3572639
- name: validation
num_bytes: 224680552
num_examples: 272872
- name: test
num_bytes: 249273956
num_examples: 274905
download_size: 1258283942
dataset_size: 3175508110
- config_name: sinhala
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3984796975
num_examples: 3587051
- name: validation
num_bytes: 326000751
num_examples: 272899
- name: test
num_bytes: 363112566
num_examples: 274911
download_size: 3220019406
dataset_size: 4673910292
- config_name: slovak
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1850051602
num_examples: 3594203
- name: validation
num_bytes: 154557657
num_examples: 275641
- name: test
num_bytes: 170226424
num_examples: 278143
download_size: 1097012176
dataset_size: 2174835683
- config_name: slovenian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1784602595
num_examples: 3593626
- name: validation
num_bytes: 149695968
num_examples: 275374
- name: test
num_bytes: 162563462
num_examples: 276873
download_size: 2380019444
dataset_size: 2096862025
- config_name: somali
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2027989680
num_examples: 3582111
- name: validation
num_bytes: 170198464
num_examples: 273168
- name: test
num_bytes: 187195768
num_examples: 275493
download_size: 1132793529
dataset_size: 2385383912
- config_name: south_azerbaijani
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2861316508
num_examples: 3572365
- name: validation
num_bytes: 237750578
num_examples: 272872
- name: test
num_bytes: 261490563
num_examples: 274905
download_size: 1341950228
dataset_size: 3360557649
- config_name: south_levantine_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2422505540
num_examples: 3572446
- name: validation
num_bytes: 200153231
num_examples: 272872
- name: test
num_bytes: 222482397
num_examples: 274905
download_size: 1183194893
dataset_size: 2845141168
- config_name: southern_pashto
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2825666617
num_examples: 3573354
- name: validation
num_bytes: 237517366
num_examples: 272872
- name: test
num_bytes: 263033910
num_examples: 274905
download_size: 1302995273
dataset_size: 3326217893
- config_name: southern_sotho
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2068850058
num_examples: 3572365
- name: validation
num_bytes: 171573895
num_examples: 272872
- name: test
num_bytes: 187999211
num_examples: 274905
download_size: 1074412885
dataset_size: 2428423164
- config_name: spanish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2161721655
num_examples: 3872864
- name: validation
num_bytes: 184471632
num_examples: 307443
- name: test
num_bytes: 205444273
num_examples: 322883
download_size: 1182596504
dataset_size: 2551637560
- config_name: standard_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4339045046
num_examples: 5857458
- name: validation
num_bytes: 331144957
num_examples: 388534
- name: test
num_bytes: 382897661
num_examples: 400032
download_size: 1580799168
dataset_size: 5053087664
- config_name: standard_latvian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1860391558
num_examples: 3572365
- name: validation
num_bytes: 155672443
num_examples: 272872
- name: test
num_bytes: 168394864
num_examples: 274905
download_size: 1061339876
dataset_size: 2184458865
- config_name: standard_malay
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1964002057
num_examples: 3593313
- name: validation
num_bytes: 162471171
num_examples: 274108
- name: test
num_bytes: 179528458
num_examples: 276744
download_size: 1000695579
dataset_size: 2306001686
- config_name: sundanese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1924405578
num_examples: 3573767
- name: validation
num_bytes: 159749483
num_examples: 273072
- name: test
num_bytes: 175461521
num_examples: 275705
download_size: 1010721074
dataset_size: 2259616582
- config_name: swahili
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1910618383
num_examples: 3580061
- name: validation
num_bytes: 160850754
num_examples: 275485
- name: test
num_bytes: 178506887
num_examples: 277688
download_size: 1021185290
dataset_size: 2249976024
- config_name: swedish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1843067837
num_examples: 3632622
- name: validation
num_bytes: 154563283
num_examples: 279291
- name: test
num_bytes: 172393013
num_examples: 286025
download_size: 1032105972
dataset_size: 2170024133
- config_name: taizzi_adeni_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2439237004
num_examples: 3572494
- name: validation
num_bytes: 202494517
num_examples: 272872
- name: test
num_bytes: 225118960
num_examples: 274905
download_size: 1185278137
dataset_size: 2866850481
- config_name: tajik
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3027849091
num_examples: 3572365
- name: validation
num_bytes: 254453315
num_examples: 272872
- name: test
num_bytes: 280691742
num_examples: 274905
download_size: 1597592403
dataset_size: 3562994148
- config_name: tamasheq
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1876056265
num_examples: 3572365
- name: validation
num_bytes: 157281898
num_examples: 272872
- name: test
num_bytes: 171652968
num_examples: 274905
download_size: 964274716
dataset_size: 2204991131
- config_name: tamil
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4846971429
num_examples: 3596707
- name: validation
num_bytes: 397406200
num_examples: 273472
- name: test
num_bytes: 443994594
num_examples: 275558
download_size: 1718959173
dataset_size: 5688372223
- config_name: telugu
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5571519008
num_examples: 4058535
- name: validation
num_bytes: 362961076
num_examples: 272920
- name: test
num_bytes: 404861098
num_examples: 274947
download_size: 2082335866
dataset_size: 6339341182
- config_name: thai
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5024401321
num_examples: 5338232
- name: validation
num_bytes: 459607575
num_examples: 452346
- name: test
num_bytes: 495094285
num_examples: 455468
download_size: 1979389165
dataset_size: 5979103181
- config_name: toba_batak
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 339934
num_examples: 1000
- name: validation
num_bytes: 68525
num_examples: 200
- name: test
num_bytes: 270791
num_examples: 800
download_size: 236860
dataset_size: 679250
- config_name: tosk_albanian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2082390116
num_examples: 3572485
- name: validation
num_bytes: 174685167
num_examples: 272872
- name: test
num_bytes: 191450773
num_examples: 274905
download_size: 1091437384
dataset_size: 2448526056
- config_name: traditional_chinese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1153322530
num_examples: 3574236
- name: validation
num_bytes: 97233449
num_examples: 272872
- name: test
num_bytes: 108005266
num_examples: 274905
download_size: 647326893
dataset_size: 1358561245
- config_name: tunisian_arabic
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2477511602
num_examples: 3572365
- name: validation
num_bytes: 205639123
num_examples: 272872
- name: test
num_bytes: 226738016
num_examples: 274905
download_size: 1231260895
dataset_size: 2909888741
- config_name: turkish
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1919543256
num_examples: 3628109
- name: validation
num_bytes: 157731647
num_examples: 276667
- name: test
num_bytes: 173356148
num_examples: 279344
download_size: 1045667618
dataset_size: 2250631051
- config_name: twi
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 2003442
num_examples: 7320
- name: validation
num_bytes: 278167
num_examples: 1142
- name: test
num_bytes: 599853
num_examples: 2378
download_size: 586358
dataset_size: 2881462
- config_name: ukrainian
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3085029543
num_examples: 3729748
- name: validation
num_bytes: 260927426
num_examples: 288316
- name: test
num_bytes: 285989353
num_examples: 291984
download_size: 1515599383
dataset_size: 3631946322
- config_name: urdu
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 3690093592
num_examples: 3876197
- name: validation
num_bytes: 241362791
num_examples: 273872
- name: test
num_bytes: 357394756
num_examples: 308466
download_size: 1684758608
dataset_size: 4288851139
- config_name: vietnamese
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2340454874
num_examples: 3613270
- name: validation
num_bytes: 194259346
num_examples: 278354
- name: test
num_bytes: 213225524
num_examples: 279426
download_size: 1158012464
dataset_size: 2747939744
- config_name: welsh
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1876402572
num_examples: 3572365
- name: validation
num_bytes: 156663733
num_examples: 272872
- name: test
num_bytes: 171072229
num_examples: 274905
download_size: 1037154717
dataset_size: 2204138534
- config_name: wolof
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 855747
num_examples: 3146
- name: validation
num_bytes: 34846
num_examples: 240
- name: test
num_bytes: 43502
num_examples: 313
download_size: 382706
dataset_size: 934095
- config_name: xhosa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1976828692
num_examples: 3574806
- name: validation
num_bytes: 164740432
num_examples: 273166
- name: test
num_bytes: 181513204
num_examples: 275499
download_size: 1084449799
dataset_size: 2323082328
- config_name: yoruba
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2452849257
num_examples: 3587233
- name: validation
num_bytes: 199786101
num_examples: 273527
- name: test
num_bytes: 219980275
num_examples: 276047
download_size: 1205442734
dataset_size: 2872615633
- config_name: zulu
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1939474626
num_examples: 3574437
- name: validation
num_bytes: 160437521
num_examples: 273107
- name: test
num_bytes: 176290083
num_examples: 275217
download_size: 1075604507
dataset_size: 2276202230
configs:
- config_name: achinese
data_files:
- split: train
path: achinese/train-*
- split: validation
path: achinese/validation-*
- split: test
path: achinese/test-*
- config_name: afrikaans
data_files:
- split: train
path: afrikaans/train-*
- split: validation
path: afrikaans/validation-*
- split: test
path: afrikaans/test-*
- config_name: algerian_arabic
data_files:
- split: validation
path: algerian_arabic/validation-*
- split: test
path: algerian_arabic/test-*
- split: train
path: algerian_arabic/train-*
- config_name: amharic
data_files:
- split: train
path: amharic/train-*
- split: validation
path: amharic/validation-*
- split: test
path: amharic/test-*
- config_name: armenian
data_files:
- split: train
path: armenian/train-*
- split: validation
path: armenian/validation-*
- split: test
path: armenian/test-*
- config_name: balinese
data_files:
- split: validation
path: balinese/validation-*
- split: train
path: balinese/train-*
- split: test
path: balinese/test-*
- config_name: banjar
data_files:
- split: train
path: banjar/train-*
- split: validation
path: banjar/validation-*
- split: test
path: banjar/test-*
- config_name: basque
data_files:
- split: train
path: basque/train-*
- split: validation
path: basque/validation-*
- split: test
path: basque/test-*
- config_name: belarusian
data_files:
- split: train
path: belarusian/train-*
- split: validation
path: belarusian/validation-*
- split: test
path: belarusian/test-*
- config_name: bemba
data_files:
- split: train
path: bemba/train-*
- split: validation
path: bemba/validation-*
- split: test
path: bemba/test-*
- config_name: bengali
data_files:
- split: train
path: bengali/train-*
- split: validation
path: bengali/validation-*
- split: test
path: bengali/test-*
- config_name: bulgarian
data_files:
- split: train
path: bulgarian/train-*
- split: validation
path: bulgarian/validation-*
- split: test
path: bulgarian/test-*
- config_name: burmese
data_files:
- split: train
path: burmese/train-*
- split: validation
path: burmese/validation-*
- split: test
path: burmese/test-*
- config_name: cantonese
data_files:
- split: train
path: cantonese/train-*
- split: validation
path: cantonese/validation-*
- split: test
path: cantonese/test-*
- config_name: catalan
data_files:
- split: train
path: catalan/train-*
- split: validation
path: catalan/validation-*
- split: test
path: catalan/test-*
- config_name: cebuano
data_files:
- split: train
path: cebuano/train-*
- split: validation
path: cebuano/validation-*
- split: test
path: cebuano/test-*
- config_name: central_kanuri
data_files:
- split: train
path: central_kanuri/train-*
- split: validation
path: central_kanuri/validation-*
- split: test
path: central_kanuri/test-*
- config_name: central_khmer
data_files:
- split: train
path: central_khmer/train-*
- split: validation
path: central_khmer/validation-*
- split: test
path: central_khmer/test-*
- config_name: central_kurdish
data_files:
- split: train
path: central_kurdish/train-*
- split: validation
path: central_kurdish/validation-*
- split: test
path: central_kurdish/test-*
- config_name: chinese
data_files:
- split: train
path: chinese/train-*
- split: validation
path: chinese/validation-*
- split: test
path: chinese/test-*
- config_name: croatian
data_files:
- split: train
path: croatian/train-*
- split: validation
path: croatian/validation-*
- split: test
path: croatian/test-*
- config_name: czech
data_files:
- split: train
path: czech/train-*
- split: validation
path: czech/validation-*
- split: test
path: czech/test-*
- config_name: danish
data_files:
- split: train
path: danish/train-*
- split: validation
path: danish/validation-*
- split: test
path: danish/test-*
- config_name: dutch
data_files:
- split: train
path: dutch/train-*
- split: validation
path: dutch/validation-*
- split: test
path: dutch/test-*
- config_name: eastern_yiddish
data_files:
- split: train
path: eastern_yiddish/train-*
- split: validation
path: eastern_yiddish/validation-*
- split: test
path: eastern_yiddish/test-*
- config_name: egyptian_arabic
data_files:
- split: train
path: egyptian_arabic/train-*
- split: validation
path: egyptian_arabic/validation-*
- split: test
path: egyptian_arabic/test-*
- config_name: english
data_files:
- split: validation
path: english/validation-*
- split: test
path: english/test-*
- split: train
path: english/train-*
- config_name: esperanto
data_files:
- split: train
path: esperanto/train-*
- split: validation
path: esperanto/validation-*
- split: test
path: esperanto/test-*
- config_name: estonian
data_files:
- split: train
path: estonian/train-*
- split: validation
path: estonian/validation-*
- split: test
path: estonian/test-*
- config_name: filipino
data_files:
- split: train
path: filipino/train-*
- split: test
path: filipino/test-*
- config_name: finnish
data_files:
- split: train
path: finnish/train-*
- split: validation
path: finnish/validation-*
- split: test
path: finnish/test-*
- config_name: fon
data_files:
- split: train
path: fon/train-*
- split: validation
path: fon/validation-*
- split: test
path: fon/test-*
- config_name: french
data_files:
- split: train
path: french/train-*
- split: validation
path: french/validation-*
- split: test
path: french/test-*
- config_name: galician
data_files:
- split: train
path: galician/train-*
- split: validation
path: galician/validation-*
- split: test
path: galician/test-*
- config_name: georgian
data_files:
- split: train
path: georgian/train-*
- split: validation
path: georgian/validation-*
- split: test
path: georgian/test-*
- config_name: german
data_files:
- split: train
path: german/train-*
- split: validation
path: german/validation-*
- split: test
path: german/test-*
- config_name: greek
data_files:
- split: train
path: greek/train-*
- split: validation
path: greek/validation-*
- split: test
path: greek/test-*
- config_name: gujarati
data_files:
- split: train
path: gujarati/train-*
- split: validation
path: gujarati/validation-*
- split: test
path: gujarati/test-*
- config_name: haitian
data_files:
- split: train
path: haitian/train-*
- split: validation
path: haitian/validation-*
- split: test
path: haitian/test-*
- config_name: halh_mongolian
data_files:
- split: train
path: halh_mongolian/train-*
- split: validation
path: halh_mongolian/validation-*
- split: test
path: halh_mongolian/test-*
- config_name: hausa
data_files:
- split: train
path: hausa/train-*
- split: validation
path: hausa/validation-*
- split: test
path: hausa/test-*
- config_name: hebrew
data_files:
- split: train
path: hebrew/train-*
- split: validation
path: hebrew/validation-*
- split: test
path: hebrew/test-*
- config_name: hindi
data_files:
- split: train
path: hindi/train-*
- split: validation
path: hindi/validation-*
- split: test
path: hindi/test-*
- config_name: hungarian
data_files:
- split: train
path: hungarian/train-*
- split: validation
path: hungarian/validation-*
- split: test
path: hungarian/test-*
- config_name: icelandic
data_files:
- split: validation
path: icelandic/validation-*
- split: test
path: icelandic/test-*
- split: train
path: icelandic/train-*
- config_name: igbo
data_files:
- split: train
path: igbo/train-*
- split: validation
path: igbo/validation-*
- split: test
path: igbo/test-*
- config_name: indonesian
data_files:
- split: train
path: indonesian/train-*
- split: validation
path: indonesian/validation-*
- split: test
path: indonesian/test-*
- config_name: iranian_persian
data_files:
- split: train
path: iranian_persian/train-*
- split: validation
path: iranian_persian/validation-*
- split: test
path: iranian_persian/test-*
- config_name: irish
data_files:
- split: train
path: irish/train-*
- split: validation
path: irish/validation-*
- split: test
path: irish/test-*
- config_name: italian
data_files:
- split: train
path: italian/train-*
- split: validation
path: italian/validation-*
- split: test
path: italian/test-*
- config_name: japanese
data_files:
- split: train
path: japanese/train-*
- split: validation
path: japanese/validation-*
- split: test
path: japanese/test-*
- config_name: javanese
data_files:
- split: train
path: javanese/train-*
- split: validation
path: javanese/validation-*
- split: test
path: javanese/test-*
- config_name: kannada
data_files:
- split: train
path: kannada/train-*
- split: validation
path: kannada/validation-*
- split: test
path: kannada/test-*
- config_name: kashmiri
data_files:
- split: train
path: kashmiri/train-*
- split: validation
path: kashmiri/validation-*
- split: test
path: kashmiri/test-*
- config_name: kazakh
data_files:
- split: train
path: kazakh/train-*
- split: validation
path: kazakh/validation-*
- split: test
path: kazakh/test-*
- config_name: kinyarwanda
data_files:
- split: train
path: kinyarwanda/train-*
- split: validation
path: kinyarwanda/validation-*
- split: test
path: kinyarwanda/test-*
- config_name: korean
data_files:
- split: train
path: korean/train-*
- split: validation
path: korean/validation-*
- split: test
path: korean/test-*
- config_name: kyrgyz
data_files:
- split: train
path: kyrgyz/train-*
- split: validation
path: kyrgyz/validation-*
- split: test
path: kyrgyz/test-*
- config_name: lao
data_files:
- split: validation
path: lao/validation-*
- split: test
path: lao/test-*
- split: train
path: lao/train-*
- config_name: ligurian
data_files:
- split: train
path: ligurian/train-*
- split: validation
path: ligurian/validation-*
- split: test
path: ligurian/test-*
- config_name: lithuanian
data_files:
- split: train
path: lithuanian/train-*
- split: validation
path: lithuanian/validation-*
- split: test
path: lithuanian/test-*
- config_name: luxembourgish
data_files:
- split: train
path: luxembourgish/train-*
- split: validation
path: luxembourgish/validation-*
- split: test
path: luxembourgish/test-*
- config_name: macedonian
data_files:
- split: train
path: macedonian/train-*
- split: validation
path: macedonian/validation-*
- split: test
path: macedonian/test-*
- config_name: madurese
data_files:
- split: train
path: madurese/train-*
- split: validation
path: madurese/validation-*
- split: test
path: madurese/test-*
- config_name: malayalam
data_files:
- split: train
path: malayalam/train-*
- split: validation
path: malayalam/validation-*
- split: test
path: malayalam/test-*
- config_name: maltese
data_files:
- split: train
path: maltese/train-*
- split: validation
path: maltese/validation-*
- split: test
path: maltese/test-*
- config_name: manipuri
data_files:
- split: train
path: manipuri/train-*
- split: validation
path: manipuri/validation-*
- split: test
path: manipuri/test-*
- config_name: maori
data_files:
- split: train
path: maori/train-*
- split: validation
path: maori/validation-*
- split: test
path: maori/test-*
- config_name: marathi
data_files:
- split: train
path: marathi/train-*
- split: validation
path: marathi/validation-*
- split: test
path: marathi/test-*
- config_name: mesopotamian_arabic
data_files:
- split: train
path: mesopotamian_arabic/train-*
- split: validation
path: mesopotamian_arabic/validation-*
- split: test
path: mesopotamian_arabic/test-*
- config_name: minangkabau
data_files:
- split: train
path: minangkabau/train-*
- split: validation
path: minangkabau/validation-*
- split: test
path: minangkabau/test-*
- config_name: moroccan_arabic
data_files:
- split: train
path: moroccan_arabic/train-*
- split: validation
path: moroccan_arabic/validation-*
- split: test
path: moroccan_arabic/test-*
- config_name: mozambican_portuguese
data_files:
- split: train
path: mozambican_portuguese/train-*
- split: validation
path: mozambican_portuguese/validation-*
- split: test
path: mozambican_portuguese/test-*
- config_name: najdi_arabic
data_files:
- split: train
path: najdi_arabic/train-*
- split: validation
path: najdi_arabic/validation-*
- split: test
path: najdi_arabic/test-*
- config_name: nepali
data_files:
- split: train
path: nepali/train-*
- split: validation
path: nepali/validation-*
- split: test
path: nepali/test-*
- config_name: ngaju
data_files:
- split: train
path: ngaju/train-*
- split: validation
path: ngaju/validation-*
- split: test
path: ngaju/test-*
- config_name: north_azerbaijani
data_files:
- split: train
path: north_azerbaijani/train-*
- split: validation
path: north_azerbaijani/validation-*
- split: test
path: north_azerbaijani/test-*
- config_name: north_levantine_arabic
data_files:
- split: train
path: north_levantine_arabic/train-*
- split: validation
path: north_levantine_arabic/validation-*
- split: test
path: north_levantine_arabic/test-*
- config_name: northern_kurdish
data_files:
- split: train
path: northern_kurdish/train-*
- split: validation
path: northern_kurdish/validation-*
- split: test
path: northern_kurdish/test-*
- config_name: northern_sotho
data_files:
- split: train
path: northern_sotho/train-*
- split: validation
path: northern_sotho/validation-*
- split: test
path: northern_sotho/test-*
- config_name: northern_uzbek
data_files:
- split: train
path: northern_uzbek/train-*
- split: validation
path: northern_uzbek/validation-*
- split: test
path: northern_uzbek/test-*
- config_name: norwegian
data_files:
- split: train
path: norwegian/train-*
- split: validation
path: norwegian/validation-*
- split: test
path: norwegian/test-*
- config_name: norwegian_bokmal
data_files:
- split: train
path: norwegian_bokmal/train-*
- split: validation
path: norwegian_bokmal/validation-*
- split: test
path: norwegian_bokmal/test-*
- config_name: norwegian_nynorsk
data_files:
- split: train
path: norwegian_nynorsk/train-*
- split: validation
path: norwegian_nynorsk/validation-*
- split: test
path: norwegian_nynorsk/test-*
- config_name: nyanja
data_files:
- split: train
path: nyanja/train-*
- config_name: panjabi
data_files:
- split: train
path: panjabi/train-*
- config_name: plateau_malagasy
data_files:
- split: train
path: plateau_malagasy/train-*
- split: validation
path: plateau_malagasy/validation-*
- split: test
path: plateau_malagasy/test-*
- config_name: polish
data_files:
- split: train
path: polish/train-*
- split: validation
path: polish/validation-*
- split: test
path: polish/test-*
- config_name: portuguese
data_files:
- split: train
path: portuguese/train-*
- split: validation
path: portuguese/validation-*
- split: test
path: portuguese/test-*
- config_name: romanian
data_files:
- split: train
path: romanian/train-*
- split: validation
path: romanian/validation-*
- split: test
path: romanian/test-*
- config_name: russian
data_files:
- split: train
path: russian/train-*
- split: validation
path: russian/validation-*
- split: test
path: russian/test-*
- config_name: samoan
data_files:
- split: train
path: samoan/train-*
- split: validation
path: samoan/validation-*
- split: test
path: samoan/test-*
- config_name: scottish_gaelic
data_files:
- split: train
path: scottish_gaelic/train-*
- split: validation
path: scottish_gaelic/validation-*
- split: test
path: scottish_gaelic/test-*
- config_name: serbian
data_files:
- split: train
path: serbian/train-*
- split: validation
path: serbian/validation-*
- split: test
path: serbian/test-*
- config_name: shona
data_files:
- split: train
path: shona/train-*
- split: validation
path: shona/validation-*
- split: test
path: shona/test-*
- config_name: simplified_chinese
data_files:
- split: train
path: simplified_chinese/train-*
- split: validation
path: simplified_chinese/validation-*
- split: test
path: simplified_chinese/test-*
- config_name: sindhi
data_files:
- split: train
path: sindhi/train-*
- split: validation
path: sindhi/validation-*
- split: test
path: sindhi/test-*
- config_name: sinhala
data_files:
- split: train
path: sinhala/train-*
- split: validation
path: sinhala/validation-*
- split: test
path: sinhala/test-*
- config_name: slovak
data_files:
- split: train
path: slovak/train-*
- split: validation
path: slovak/validation-*
- split: test
path: slovak/test-*
- config_name: slovenian
data_files:
- split: validation
path: slovenian/validation-*
- split: test
path: slovenian/test-*
- split: train
path: slovenian/train-*
- config_name: somali
data_files:
- split: train
path: somali/train-*
- split: validation
path: somali/validation-*
- split: test
path: somali/test-*
- config_name: south_azerbaijani
data_files:
- split: train
path: south_azerbaijani/train-*
- split: validation
path: south_azerbaijani/validation-*
- split: test
path: south_azerbaijani/test-*
- config_name: south_levantine_arabic
data_files:
- split: train
path: south_levantine_arabic/train-*
- split: validation
path: south_levantine_arabic/validation-*
- split: test
path: south_levantine_arabic/test-*
- config_name: southern_pashto
data_files:
- split: train
path: southern_pashto/train-*
- split: validation
path: southern_pashto/validation-*
- split: test
path: southern_pashto/test-*
- config_name: southern_sotho
data_files:
- split: train
path: southern_sotho/train-*
- split: validation
path: southern_sotho/validation-*
- split: test
path: southern_sotho/test-*
- config_name: spanish
data_files:
- split: train
path: spanish/train-*
- split: validation
path: spanish/validation-*
- split: test
path: spanish/test-*
- config_name: standard_arabic
data_files:
- split: train
path: standard_arabic/train-*
- split: validation
path: standard_arabic/validation-*
- split: test
path: standard_arabic/test-*
- config_name: standard_latvian
data_files:
- split: train
path: standard_latvian/train-*
- split: validation
path: standard_latvian/validation-*
- split: test
path: standard_latvian/test-*
- config_name: standard_malay
data_files:
- split: train
path: standard_malay/train-*
- split: validation
path: standard_malay/validation-*
- split: test
path: standard_malay/test-*
- config_name: sundanese
data_files:
- split: train
path: sundanese/train-*
- split: validation
path: sundanese/validation-*
- split: test
path: sundanese/test-*
- config_name: swahili
data_files:
- split: train
path: swahili/train-*
- split: validation
path: swahili/validation-*
- split: test
path: swahili/test-*
- config_name: swedish
data_files:
- split: train
path: swedish/train-*
- split: validation
path: swedish/validation-*
- split: test
path: swedish/test-*
- config_name: taizzi_adeni_arabic
data_files:
- split: train
path: taizzi_adeni_arabic/train-*
- split: validation
path: taizzi_adeni_arabic/validation-*
- split: test
path: taizzi_adeni_arabic/test-*
- config_name: tajik
data_files:
- split: validation
path: tajik/validation-*
- split: test
path: tajik/test-*
- split: train
path: tajik/train-*
- config_name: tamasheq
data_files:
- split: train
path: tamasheq/train-*
- split: validation
path: tamasheq/validation-*
- split: test
path: tamasheq/test-*
- config_name: tamil
data_files:
- split: train
path: tamil/train-*
- split: validation
path: tamil/validation-*
- split: test
path: tamil/test-*
- config_name: telugu
data_files:
- split: train
path: telugu/train-*
- split: validation
path: telugu/validation-*
- split: test
path: telugu/test-*
- config_name: thai
data_files:
- split: train
path: thai/train-*
- split: validation
path: thai/validation-*
- split: test
path: thai/test-*
- config_name: toba_batak
data_files:
- split: train
path: toba_batak/train-*
- split: validation
path: toba_batak/validation-*
- split: test
path: toba_batak/test-*
- config_name: tosk_albanian
data_files:
- split: train
path: tosk_albanian/train-*
- split: validation
path: tosk_albanian/validation-*
- split: test
path: tosk_albanian/test-*
- config_name: traditional_chinese
data_files:
- split: train
path: traditional_chinese/train-*
- split: validation
path: traditional_chinese/validation-*
- split: test
path: traditional_chinese/test-*
- config_name: tunisian_arabic
data_files:
- split: train
path: tunisian_arabic/train-*
- split: validation
path: tunisian_arabic/validation-*
- split: test
path: tunisian_arabic/test-*
- config_name: turkish
data_files:
- split: train
path: turkish/train-*
- split: validation
path: turkish/validation-*
- split: test
path: turkish/test-*
- config_name: twi
data_files:
- split: train
path: twi/train-*
- split: validation
path: twi/validation-*
- split: test
path: twi/test-*
- config_name: ukrainian
data_files:
- split: train
path: ukrainian/train-*
- split: validation
path: ukrainian/validation-*
- split: test
path: ukrainian/test-*
- config_name: urdu
data_files:
- split: train
path: urdu/train-*
- split: validation
path: urdu/validation-*
- split: test
path: urdu/test-*
- config_name: vietnamese
data_files:
- split: train
path: vietnamese/train-*
- split: validation
path: vietnamese/validation-*
- split: test
path: vietnamese/test-*
- config_name: welsh
data_files:
- split: train
path: welsh/train-*
- split: validation
path: welsh/validation-*
- split: test
path: welsh/test-*
- config_name: wolof
data_files:
- split: train
path: wolof/train-*
- split: validation
path: wolof/validation-*
- split: test
path: wolof/test-*
- config_name: xhosa
data_files:
- split: train
path: xhosa/train-*
- split: validation
path: xhosa/validation-*
- split: test
path: xhosa/test-*
- config_name: yoruba
data_files:
- split: train
path: yoruba/train-*
- split: validation
path: yoruba/validation-*
- split: test
path: yoruba/test-*
- config_name: zulu
data_files:
- split: train
path: zulu/train-*
- split: validation
path: zulu/validation-*
- split: test
path: zulu/test-*
---

****This is a re-upload of the [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection), and only differs in the structure of upload. While the original [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection) is structured by folders split according to dataset name, this dataset is split by language. We recommend you use this version of the dataset if you are only interested in downloading all of the Aya collection for a single or smaller set of languages.****
# Dataset Summary
The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks.
This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection.
- **Curated by:** Contributors of [Aya Open Science Intiative](https://cohere.com/research/aya)
- **Language(s):** 115 languages
- **License:** [Apache 2.0](https://opensource.org/license/apache-2-0)
- **Aya Datasets Family:**
| Name | Explanation |
|------|--------------|
| [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) | Human-annotated multilingual instruction finetuning dataset, comprising over 204K instances across 65 languages. |
| [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection) | Created by applying instruction-style templates from fluent speakers to 44 datasets, including translations of 19 instruction-style datasets into 101 languages. This collection structured based on dataset level subsets. An alternative version of the collection structured by language subsets is also available.|
| [aya_collection_language_split](https://huggingface.co/datasets/CohereForAI/aya_collection_language_split) | Aya Collection structured based on language level subsets. |
| [aya_evaluation_suite](https://huggingface.co/datasets/CohereForAI/aya_evaluation_suite) | A diverse evaluation set for multilingual open-ended generation, featuring 250 culturally grounded prompts in 7 languages, 200 translated prompts in 24 languages, and human-edited versions selected for cross-cultural relevance from English Dolly in 6 languages.|
| [aya_redteaming](https://huggingface.co/datasets/CohereForAI/aya_redteaming)| A red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.|
# Dataset
The `Aya Collection` is a comprehensive, large corpus of datasets that can be used by researchers around the world to train multilingual models. Our goal is only to include datasets with permissive licensing for manipulation and redistribution.
The `Aya Collection` consists of three different sources of data:
1. Templated data: We collaborated with fluent speakers to create templates that allowed for the automatic expansion of existing datasets into various languages.
2. Translated data: We translated a hand-selected subset of 19 datasets into 101 languages (114 dialects) using the NLLB 3.3B parameter machine translation model.
3. Aya Dataset: We release the [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) as a subset of the overall collection. This is the only dataset in the collection that is human-annotated in its entirety.
## Load with Datasets
To load this dataset with Datasets, you'll need to install Datasets as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
dataset = load_dataset("CohereForAI/aya_collection_language_split", "english")
```
In the above code snippet, "english" refers to a subset of the aya_collection. You can load other subsets by specifying its name at the time of loading the dataset.
## Data Instances
An example of a `train` instance looks as follows:
```json
{'id': 246001,
'inputs': 'The following query in English is taken from the geography category. What could be the answer to the question?\nWhat is the seventh tallest mountain in North America?',
'targets': 'The answer is Mount Lucania.',
'dataset_name': 'Mintaka-inst',
'sub_dataset_name': '-',
'task_type': 'question-answering',
'template_id': 3,
'language': 'eng',
'split': 'train',
'script': 'Latn'
}
```
## Data Fields
The data fields are the same among all splits:
- `id:` Unique id of the data point
- `inputs:` Prompt or input to the language model.
- `targets:` Completion or output of the language model.
- `dataset_name:` The name of the source dataset that the data point was taken from
- `sub_dataset_name:` If the source is a collection, this field indicates which part of that collection the data point was taken from. If it is not a collection, this field is left blank.
- `task_type:` The task type that this conversation belongs to.
- `template_id`: The id of the template applied to this data point.
- `language:` The ISO code of the dialect of the conversation.
- `script:` The script of the language.
- `split:` Indicates whether the data point is part of the `train` or the `test` split.
### Statistics
The total number of data points, including the Aya Dataset` is 513,758,189. To view the breakdown of dialect codes and the respective templated and translated data point counts in the Aya Collection , refer to the toggled table below.
<details>
<summary> <b> Breakdown of Aya Collection data point counts grouped by dialects </b> </summary>
|dialect code|language|total count |
|------------|--------|---------------|
|ace |Achinese|8242684 |
|acm |Arabic |4120342 |
|acq |Arabic |4120342 |
|aeb |Arabic |4120342 |
|afr |Afrikaans|4126450 |
|ajp |Arabic |4120342 |
|als |Albanian|4120342 |
|amh |Amharic |4145669 |
|apc |Arabic |4120342 |
|arb |Arabic |6641429 |
|ars |Arabic |4120342 |
|ary |Arabic |4138418 |
|arz |Arabic |4120342 |
|azb |Azerbaijani|4120342 |
|azj |Azerbaijani|4120342 |
|bel |Belarusian|4141615 |
|ben |Bengali |4151003 |
|bjn |Banjar |8242684 |
|bul |Bulgarian|4158064 |
|cat |Catalan |4187242 |
|ceb |Cebuano |4120342 |
|ces |Czech |4299946 |
|ckb |Kurdish |4120342 |
|cym |Welsh |4120342 |
|dan |Danish |4156652 |
|deu |German |5447064 |
|ell |Greek |4160633 |
|eng |English |17838105 |
|epo |Esperanto|4120342 |
|est |Estonian|4120342 |
|eus |Basque |4120342 |
|fin |Finnish |4578237 |
|fra |French |4955862 |
|gla |Scottish Gaelic|4120342 |
|gle |Irish |4120342 |
|glg |Galician|4120342 |
|guj |Gujarati|4122499 |
|hat |Haitian Creole|4120342 |
|hau |Hausa |4171738 |
|heb |Hebrew |4223808 |
|hin |Hindi |4380729 |
|hun |Hungarian|4202381 |
|hye |Armenian|4127422 |
|ibo |Igbo |4156654 |
|ind |Indonesian|4166051 |
|isl |Icelandic|4120342 |
|ita |Italian |4526024 |
|jav |Javanese|4121171 |
|jpn |Japanese|6813519 |
|kan |Kannada |4121498 |
|kas |Kashmiri|4120342 |
|kat |Georgian|4120342 |
|kaz |Kazakh |4120342 |
|khk |Mongolian|4120342 |
|khm |Khmer |4120342 |
|kir |Kyrgyz |4120342 |
|kmr |Kurdish |4120342 |
|knc |Kanuri |8240684 |
|kor |Korean |4161353 |
|lao |Lao |4120342 |
|lit |Lithuanian|4120342 |
|ltz |Luxembourgish|4120342 |
|lvs |Latvian |4120342 |
|mal |Malayalam|4124689 |
|mar |Marathi |4124020 |
|min |Minangkabau|6755788 |
|mkd |Macedonian|4120342 |
|mlt |Maltese |4120342 |
|mni |Manipuri|4120342 |
|mri |Maori |4120342 |
|mya |Burmese |4120342 |
|nld |Dutch |4340523 |
|nno |Norwegian|4120342 |
|nob |Norwegian|4120342 |
|npi |Nepali |4120342 |
|nso |Northern Sotho|4120342 |
|pbt |Pashto |4120342 |
|pes |Persian |4365862 |
|plt |Malagasy|4120342 |
|pol |Polish |4452845 |
|por |Portuguese|4407774 |
|ron |Romanian|4156701 |
|rus |Russian |4666262 |
|sin |Sinhala |4120537 |
|slk |Slovak |4148187 |
|slv |Slovenian|4146073 |
|smo |Samoan |4120342 |
|sna |Shona |4124026 |
|snd |Sindhi |4120342 |
|som |Somali |4123268 |
|sot |Southern Sotho|4120342 |
|spa |Spanish |4499536 |
|srp |Serbian |4197466 |
|sun |Sundanese|4122550 |
|swe |Swedish |4196828 |
|swh |Swahili |4133068 |
|tam |Tamil |4131804 |
|taq |Tamasheq|4120342 |
|tel |Telugu |4598163 |
|tgk |Tajik |4120342 |
|tha |Thai |6245522 |
|tur |Turkish |4180274 |
|ukr |Ukrainian|4309726 |
|urd |Urdu |4458081 |
|uzn |Uzbek |4120342 |
|vie |Vietnamese|4162574 |
|xho |Xhosa |4123294 |
|ydd |Yiddish |4120342 |
|yor |Yoruba |4125249 |
|yue |Chinese |4120342 |
|zho-Hans |Chinese |4174870 |
|zho-Hant |Chinese |4120342 |
|zsm |Malay |4134292 |
|zul |Zulu |4121128 |
|arq |Arabic |6046 |
|ban |Balinese|2000 |
|bbc |Toba Batak|2000 |
|bem |Bemba |776 |
|fil |Filipino|220 |
|fon |Fon |845 |
|hrv |Croatian|9007 |
|kin |Kinyarwanda|11165 |
|lij |Ligurian|6409 |
|mad |Madurese|2000 |
|nij |Ngaju |2000 |
|nor |Norwegian|72352 |
|pan |Punjabi |2156 |
|twi |Twi |10840 |
|wol |Wolof |785 |
|zho |Chinese |74972 |
PS: Templated data also includes Mozambican Portuguese, which doesn't have its own ISO language code.
</details>
<br>
# Motivations & Intentions
- **Curation Rationale:** Automatic augmentation of existing datasets serves to enhance the available linguistic resources for multiple languages. The list of languages was initially established from mT5 and aligned with the annotators’ language list and NLLB translation model. The datasets were translated directly from English for all languages.
# Additional Information
## Provenance
- **Methods Used:** A combination of crowd-sourced templating and automatic translation was employed to source this dataset.
- **Methodology Details:**
- *Source:* Existing NLP datasets
- *Dates of Collection:* May 2023 - Dec 2023
## Dataset Version and Maintenance
- **Maintenance Status:** Actively Maintained
- **Version Details:**
- *Current version:* 1.0
- *Last Update:* 02/2024
- *First Release:* 02/2024
## Authorship
- **Publishing Organization:** [Cohere For AI](https://cohere.com/research)
- **Industry Type:** Not-for-profit - Tech
- **Contact Details:** https://cohere.com/research/aya
## Licensing Information
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License.
## Citation Information
```bibtex
@misc{singh2024aya,
title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning},
author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker},
year={2024},
eprint={2402.06619},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
CohereForAI
原始信息汇总
数据集概述
数据集语言支持
本数据集支持多种语言,包括但不限于:
- ace
- afr
- amh
- ara
- aze
- ban
- bbc
- bel
- bem
- ben
- bjn
- bul
- cat
- ceb
- ces
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fil
- fin
- fon
- fra
- gla
- gle
- glg
- guj
- hat
- hau
- heb
- hin
- hrv
- hun
- hye
- ibo
- ind
- isl
- ita
- jav
- jpn
- kan
- kas
- kat
- kau
- kaz
- khm
- kin
- kir
- kor
- kur
- lao
- lav
- lij
- lit
- ltz
- mad
- mal
- man
- mar
- min
- mkd
- mlg
- mlt
- mon
- mri
- msa
- mya
- nep
- nij
- nld
- nor
- nso
- nya
- pan
- pes
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- srp
- sun
- swa
- swe
- tam
- taq
- tel
- tgk
- tha
- tur
- twi
- ukr
- urd
- uzb
- vie
- wol
- xho
- yid
- yor
- zho
- zul
数据集特征
每个语言配置的数据集包含以下特征:
- id: 数据类型为 int64
- inputs: 数据类型为 string
- targets: 数据类型为 string
- dataset_name: 数据类型为 string
- sub_dataset_name: 数据类型为 string
- task_type: 数据类型为 string
- template_id: 数据类型为 int64
- language: 数据类型为 string
- script: 数据类型为 string
- split: 数据类型为 string
数据集拆分
数据集根据不同的语言配置,拆分为训练集、验证集和测试集,具体信息如下:
训练集
- 每个语言配置的训练集包含不同数量的字节和示例数。
- 示例:achinese 训练集包含 4777872484 字节和 7145730 示例。
验证集
- 每个语言配置的验证集包含不同数量的字节和示例数。
- 示例:achinese 验证集包含 399703157 字节和 545944 示例。
测试集
- 每个语言配置的测试集包含不同数量的字节和示例数。
- 示例:achinese 测试集包含 438143574 字节和 550610 示例。
数据集大小
每个语言配置的数据集大小包括下载大小和数据集总大小,具体信息如下:
- 下载大小:从数据集下载时的大小。
- 数据集总大小:包括所有拆分后的数据集总大小。
- 示例:achinese 数据集的下载大小为 2233825990,数据集总大小为 5615719215。
许可证
本数据集遵循 Apache-2.0 许可证。
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,多语言数据集的构建对于推动语言模型的泛化能力至关重要。CohereForAI/aya_collection_language_split数据集通过整合来自多样化来源的指令遵循任务,采用语言分割策略进行系统化构建。该数据集从多个公开数据集中提取样本,并依据语言标识符进行重组,确保每种语言配置独立存在。构建过程中,每个样本均标注了任务类型、模板标识符及脚本信息,并通过训练集、验证集和测试集的划分,为模型评估提供了结构化支持。
特点
该数据集以其广泛的语言覆盖和任务多样性而著称,涵盖了从阿塞拜疆语到祖鲁语等超过一百种语言,包括多种低资源语言。每个语言配置均包含输入与目标字段,并附带详细的元数据,如数据集名称、子数据集名称和任务类型,便于进行细粒度的语言分析。数据规模庞大,例如英语配置包含超过1400万训练样本,而低资源语言如本巴语则样本较少,体现了资源分配的平衡性。这种设计支持跨语言迁移学习和多语言模型性能的全面评估。
使用方法
使用该数据集时,研究人员可通过HuggingFace库直接加载特定语言配置,例如'achinese'或'english',以访问对应的训练、验证和测试分割。数据集适用于指令微调、多语言文本生成及跨语言理解任务,用户可基于输入-目标对进行模型训练,并利用任务类型字段进行任务特定优化。在评估阶段,建议结合多种语言配置以测试模型的泛化能力,同时注意低资源语言的数据稀缺性可能影响性能,需采用适当的数据增强或迁移学习策略。
背景与挑战
背景概述
在自然语言处理领域,多语言模型的训练长期受限于高质量、多样化语料库的稀缺性,尤其是对于资源匮乏的语言。CohereForAI/aya_collection_language_split数据集由Cohere for AI团队于2024年构建,旨在应对这一挑战。该数据集覆盖了超过100种语言,包括阿塞拜疆语、亚美尼亚语、班贾尔语等众多低资源语言,核心研究问题聚焦于如何通过大规模、多任务的指令微调数据,提升模型在跨语言理解和生成任务上的泛化能力。其广泛的语言覆盖为多语言大语言模型的研究提供了关键的数据基础,推动了语言技术在全球范围内的包容性发展。
当前挑战
该数据集致力于解决多语言指令遵循与文本生成任务的挑战,其核心在于如何使模型在多样语言和文化背景下准确理解并执行复杂指令。构建过程中的挑战尤为显著:首先,数据收集需平衡语言代表性,许多低资源语言如邦板牙语或巴厘语可用数字文本极少,导致语料规模和质量不均;其次,数据标注需要跨语言的一致性,不同语言的指令模板需保持任务语义的对等性,这依赖于专业语言学的深入参与;此外,数据清洗与预处理需处理脚本变体、方言差异及噪声问题,确保多语言语料的纯净度与可用性。
常用场景
经典使用场景
在自然语言处理领域,多语言指令微调已成为提升模型跨语言泛化能力的关键范式。CohereForAI/aya_collection_language_split数据集以其涵盖101种语言的庞大指令对集合,为研究者提供了构建多语言对话与理解模型的经典训练资源。该数据集通过统一的输入-输出格式,支持模型学习从翻译、摘要到问答等多种任务在不同语言中的表现形式,尤其适用于训练具备广泛语言适应性的指令跟随模型。
解决学术问题
长期以来,自然语言处理研究受限于高质量多语言数据的稀缺,尤其是低资源语言的指令对齐数据匮乏。该数据集系统性地解决了多语言指令微调中数据不平衡与覆盖不足的核心难题,为探索模型在非拉丁语系及低资源语言上的性能提供了基准。其意义在于推动了语言技术民主化进程,使得学术研究能够更公平地评估和提升模型在全球语言社群中的实用性与包容性。
衍生相关工作
该数据集的发布催生了多语言大模型研究的一系列创新探索。基于其构建的Aya模型系列展示了在百种语言指令任务上的强大性能,后续研究进一步拓展至多语言思维链推理、文化适应性对话生成等方向。相关工作还包括利用该数据集进行语言表征对比分析,揭示跨语言迁移中的语法结构泛化规律,以及开发针对特定语言族(如非洲语言、南亚语言)的专项优化框架,持续丰富多语言人工智能的技术生态。
以上内容由遇见数据集搜集并总结生成



