janavivekariya/aya_collection
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/janavivekariya/aya_collection
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ace
- afr
- amh
- ara
- aze
- ban
- bbc
- bel
- bem
- ben
- bjn
- bul
- cat
- ceb
- ces
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fil
- fin
- fon
- fra
- gla
- gle
- glg
- guj
- hat
- hau
- heb
- hin
- hrv
- hun
- hye
- ibo
- ind
- isl
- ita
- jav
- jpn
- kan
- kas
- kat
- kau
- kaz
- khm
- kin
- kir
- kor
- kur
- lao
- lav
- lij
- lit
- ltz
- mad
- mal
- man
- mar
- min
- mkd
- mlg
- mlt
- mon
- mri
- msa
- mya
- nep
- nij
- nld
- nor
- nso
- nya
- pan
- pes
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- srp
- sun
- swa
- swe
- tam
- taq
- tel
- tgk
- tha
- tur
- twi
- ukr
- urd
- uzb
- vie
- wol
- xho
- yid
- yor
- zho
- zul
license: apache-2.0
size_categories:
- 100M<n<1B
task_categories:
- text-classification
- summarization
- translation
pretty_name: Aya Collection
dataset_info:
- config_name: aya_dataset
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 245523658
num_examples: 202364
download_size: 134230030
dataset_size: 245523658
- config_name: templated_afriqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 1053208.8833372337
num_examples: 6834
- name: train
num_bytes: 785976.7786098759
num_examples: 5100
- name: validation
num_bytes: 794915.3380528903
num_examples: 5158
download_size: 945238
dataset_size: 2634101.0
- config_name: templated_afrisenti
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 13970874.910620399
num_examples: 42576
- name: train
num_bytes: 32313882.88468279
num_examples: 98476
- name: validation
num_bytes: 6141462.204696811
num_examples: 18716
download_size: 13309887
dataset_size: 52426220.0
- config_name: templated_amharic_qa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 1563941.8685517767
num_examples: 523
- name: train
num_bytes: 5475291.704241497
num_examples: 1831
- name: validation
num_bytes: 786456.4272067252
num_examples: 263
download_size: 3648433
dataset_size: 7825689.999999999
- config_name: templated_armenian_instruct
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 1864796.3648305084
num_examples: 3063
- name: train
num_bytes: 2445604.6351694916
num_examples: 4017
download_size: 1825641
dataset_size: 4310401.0
- config_name: templated_bengali_news
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 14242457
num_examples: 19096
download_size: 4609132
dataset_size: 14242457
- config_name: templated_dutch_imdb
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 39967063.5
num_examples: 24992
- name: train
num_bytes: 39967063.5
num_examples: 24992
download_size: 44533807
dataset_size: 79934127.0
- config_name: templated_hindi_headline
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 228788501.12729776
num_examples: 23452
- name: train
num_bytes: 919144047.8727022
num_examples: 94217
download_size: 243324488
dataset_size: 1147932549.0
- config_name: templated_hindi_news
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 109524809.11948325
num_examples: 10655
- name: train
num_bytes: 437112433.88051677
num_examples: 42524
download_size: 112865381
dataset_size: 546637243.0
- config_name: templated_indic_paraphrase
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 5340504
num_examples: 7523
download_size: 1724626
dataset_size: 5340504
- config_name: templated_indic_sentiment
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 7496187
num_examples: 11559
download_size: 3003109
dataset_size: 7496187
- config_name: templated_indo_stories
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 2042351
num_examples: 2599
download_size: 813713
dataset_size: 2042351
- config_name: templated_japanese_instruct
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 1345341895
num_examples: 2463624
download_size: 580330810
dataset_size: 1345341895
- config_name: templated_joke_explaination
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 591008
num_examples: 754
download_size: 157851
dataset_size: 591008
- config_name: templated_ligurian_news
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: validation
num_bytes: 105221.25
num_examples: 54
- name: test
num_bytes: 140295.0
num_examples: 72
- name: train
num_bytes: 596253.75
num_examples: 306
download_size: 546344
dataset_size: 841770.0
- config_name: templated_masakhanews
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 31426840.99009901
num_examples: 9240
- name: train
num_bytes: 109538186.24752475
num_examples: 32206
- name: validation
num_bytes: 15679408.762376238
num_examples: 4610
download_size: 86433056
dataset_size: 156644436.0
- config_name: templated_mintaka
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 41153051.4
num_examples: 156000
- name: train
num_bytes: 144035679.9
num_examples: 546000
- name: validation
num_bytes: 20576525.7
num_examples: 78000
download_size: 43108344
dataset_size: 205765257.0
- config_name: templated_ntx_llm
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 10019994
num_examples: 5983
download_size: 1037270
dataset_size: 10019994
- config_name: templated_nusax_senti
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 2684840.4
num_examples: 8000
- name: train
num_bytes: 3356050.5
num_examples: 10000
- name: validation
num_bytes: 671210.1
num_examples: 2000
download_size: 2336444
dataset_size: 6712101.0
- config_name: templated_persian_farstail
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 731412.1801486664
num_examples: 1029
- name: train
num_bytes: 3424629.62483603
num_examples: 4818
- name: validation
num_bytes: 720750.1950153039
num_examples: 1014
download_size: 1417008
dataset_size: 4876792.0
- config_name: templated_persian_instruct
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 38518994.420354694
num_examples: 11186
- name: train
num_bytes: 564885564.1599021
num_examples: 164044
- name: validation
num_bytes: 38512107.41974315
num_examples: 11184
download_size: 280563392
dataset_size: 641916666.0
- config_name: templated_scirepeval
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: validation
num_bytes: 53956804
num_examples: 32973
download_size: 27742964
dataset_size: 53956804
- config_name: templated_seed_instruct
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: validation
num_bytes: 186542.23316647828
num_examples: 380
- name: test
num_bytes: 197342.04666559017
num_examples: 402
- name: train
num_bytes: 5696410.720167931
num_examples: 11604
download_size: 2674875
dataset_size: 6080295.0
- config_name: templated_soda
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 487742788.92976975
num_examples: 595872
- name: train
num_bytes: 2519225981.566041
num_examples: 3077721
- name: validation
num_bytes: 479157981.5041894
num_examples: 585384
download_size: 1668121549
dataset_size: 3486126752.0
- config_name: templated_tamil_stories
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 14555943
num_examples: 1202
download_size: 4912529
dataset_size: 14555943
- config_name: templated_tamil_thirukkural
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 7722387
num_examples: 3990
download_size: 1441119
dataset_size: 7722387
- config_name: templated_telugu_food
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 1108509
num_examples: 441
download_size: 312391
dataset_size: 1108509
- config_name: templated_telugu_jokes
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 966698
num_examples: 929
download_size: 298210
dataset_size: 966698
- config_name: templated_telugu_news
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 1150840295
num_examples: 467090
download_size: 423260269
dataset_size: 1150840295
- config_name: templated_telugu_poems
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 8244805
num_examples: 5115
download_size: 2713433
dataset_size: 8244805
- config_name: templated_telugu_riddles
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 339040
num_examples: 844
download_size: 79031
dataset_size: 339040
- config_name: templated_thai_pos
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 319580.309461865
num_examples: 1000
- name: train
num_bytes: 41690529.69053814
num_examples: 130454
download_size: 7405764
dataset_size: 42010110.0
- config_name: templated_thai_scb
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 131923007.25034823
num_examples: 177862
- name: train
num_bytes: 1188824615.223528
num_examples: 1602804
- name: validation
num_bytes: 131917073.5261238
num_examples: 177854
download_size: 441007386
dataset_size: 1452664696.0
- config_name: templated_thai_usembassy
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 10002322
num_examples: 1230
download_size: 3958145
dataset_size: 10002322
- config_name: templated_thai_wikitionary
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 12238652
num_examples: 19729
download_size: 2641369
dataset_size: 12238652
- config_name: templated_turku_paraphrase
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 9449925.655740838
num_examples: 31413
- name: train
num_bytes: 75488399.52960008
num_examples: 250935
- name: validation
num_bytes: 9502269.814659085
num_examples: 31587
download_size: 28908781
dataset_size: 94440595.00000001
- config_name: templated_ukranian_gec
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 21369624
num_examples: 29958
download_size: 9511988
dataset_size: 21369624
- config_name: templated_uner_llm
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 59421032.72376601
num_examples: 54957
- name: test
num_bytes: 16164354.663105734
num_examples: 14950
- name: validation
num_bytes: 8420601.613128258
num_examples: 7788
download_size: 12453483
dataset_size: 84005989.0
- config_name: templated_urdu_news_category
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 29923228.33936761
num_examples: 11187
- name: train
num_bytes: 269284981.6606324
num_examples: 100674
download_size: 118185925
dataset_size: 299208210.0
- config_name: templated_urdu_news_gen
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 29497844.81704079
num_examples: 11187
- name: train
num_bytes: 265456872.1829592
num_examples: 100674
download_size: 123276747
dataset_size: 294954717.0
- config_name: templated_urdu_news_headline
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 29258423.35545901
num_examples: 11187
- name: train
num_bytes: 263302271.644541
num_examples: 100674
download_size: 123095949
dataset_size: 292560695.0
- config_name: templated_wiki_split
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 4608986.773259303
num_examples: 10000
- name: train
num_bytes: 912527760.4534814
num_examples: 1979888
- name: validation
num_bytes: 4608986.773259303
num_examples: 10000
download_size: 395631256
dataset_size: 921745734.0
- config_name: templated_xcsqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: validation
num_bytes: 6315047.0
num_examples: 17000
download_size: 2125506
dataset_size: 6315047.0
- config_name: templated_xlel_wd
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 493033268.5027245
num_examples: 621319
- name: train
num_bytes: 3671177872.612755
num_examples: 4626407
- name: validation
num_bytes: 420416838.88452065
num_examples: 529808
download_size: 2363004380
dataset_size: 4584627980.0
- config_name: templated_xwikis
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: test
num_bytes: 219985468.96557257
num_examples: 34987
- name: train
num_bytes: 8995693557.81201
num_examples: 1430696
- name: validation
num_bytes: 251360765.22241676
num_examples: 39977
download_size: 5713306872
dataset_size: 9467039791.999998
- config_name: translated_adversarial_qa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 167379954.08333334
num_examples: 119000
- name: train
num_bytes: 1673799540.8333333
num_examples: 1190000
- name: validation
num_bytes: 167379954.08333334
num_examples: 119000
download_size: 595462085
dataset_size: 2008559448.9999998
- config_name: translated_cnn_dailymail
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 4825107898.98773
num_examples: 1378800
- name: train
num_bytes: 41993976492.495476
num_examples: 12000000
- name: validation
num_bytes: 5613754777.516795
num_examples: 1604160
download_size: 25383694727
dataset_size: 52432839169.0
- config_name: translated_dolly
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: split
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 2188278931
num_examples: 1762152
download_size: 1089137630
dataset_size: 2188278931
- config_name: translated_flan_coqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2884413536
num_examples: 762671
download_size: 1416350365
dataset_size: 2884413536
- config_name: translated_flan_cot
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 7470682150.0
num_examples: 11029200
download_size: 3086804878
dataset_size: 7470682150.0
- config_name: translated_flan_gem_wiki
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 11446176046
num_examples: 3230493
download_size: 5342129672
dataset_size: 11446176046
- config_name: translated_flan_lambada
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 223527122
num_examples: 509201
download_size: 99315916
dataset_size: 223527122
- config_name: translated_flan_qa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 34188800
num_examples: 64260
download_size: 14245088
dataset_size: 34188800
- config_name: translated_hotpotqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 13234982265.87797
num_examples: 42301644
- name: validation
num_bytes: 833990488.1220294
num_examples: 2665600
download_size: 4862020346
dataset_size: 14068972754.0
- config_name: translated_joke_explaination
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 96548938
num_examples: 89726
download_size: 40366737
dataset_size: 96548938
- config_name: translated_mintaka
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 131276187.4
num_examples: 476000
- name: train
num_bytes: 459466655.9
num_examples: 1666000
- name: validation
num_bytes: 65638093.7
num_examples: 238000
download_size: 130340546
dataset_size: 656380937.0
- config_name: translated_mlqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 3730486242.0756793
num_examples: 2746830
- name: validation
num_bytes: 369508041.92432094
num_examples: 272076
download_size: 1662296336
dataset_size: 4099994284.0
- config_name: translated_nqopen
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 4456165405.095046
num_examples: 20926150
- name: validation
num_bytes: 182959989.9049544
num_examples: 859180
download_size: 1482593128
dataset_size: 4639125395.0
- config_name: translated_paws
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 536748719.07157385
num_examples: 952000
- name: train
num_bytes: 3314490433.8568525
num_examples: 5878719
- name: validation
num_bytes: 536748719.07157385
num_examples: 952000
download_size: 686023556
dataset_size: 4387987872.0
- config_name: translated_piqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1324751595.2891204
num_examples: 1917447
- name: validation
num_bytes: 151113599.71087962
num_examples: 218722
download_size: 504206733
dataset_size: 1475865195.0
- config_name: translated_soda
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 9332736341.158312
num_examples: 17876160
- name: validation
num_bytes: 9168469957.193184
num_examples: 17561520
- name: train
num_bytes: 74651741547.6485
num_examples: 142989840
download_size: 32022718450
dataset_size: 93152947846.0
- config_name: translated_wiki_split
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 72471632064.9965
num_examples: 117803336
- name: validation
num_bytes: 366039049.0017441
num_examples: 595000
- name: test
num_bytes: 366039049.0017441
num_examples: 595000
download_size: 27980267627
dataset_size: 73203710163.0
- config_name: translated_wikiqa
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 15512870.67820774
num_examples: 34867
- name: train
num_bytes: 55062749.16496945
num_examples: 123760
- name: validation
num_bytes: 7412293.156822811
num_examples: 16660
download_size: 32773189
dataset_size: 77987913.00000001
- config_name: translated_xlel_wd
features:
- name: id
dtype: int64
- name: inputs
dtype: string
- name: targets
dtype: string
- name: dataset_name
dtype: string
- name: sub_dataset_name
dtype: string
- name: task_type
dtype: string
- name: template_id
dtype: int64
- name: language
dtype: string
- name: script
dtype: string
- name: split
dtype: string
splits:
- name: test
num_bytes: 8449087876.213723
num_examples: 8755108
- name: validation
num_bytes: 7326325551.677284
num_examples: 7591680
- name: train
num_bytes: 60579299633.10899
num_examples: 62773440
download_size: 35927637128
dataset_size: 76354713061.0
configs:
- config_name: aya_dataset
data_files:
- split: train
path: aya_dataset/train-*
- config_name: templated_afriqa
data_files:
- split: test
path: templated_afriqa/test-*
- split: train
path: templated_afriqa/train-*
- split: validation
path: templated_afriqa/validation-*
- config_name: templated_afrisenti
data_files:
- split: test
path: templated_afrisenti/test-*
- split: train
path: templated_afrisenti/train-*
- split: validation
path: templated_afrisenti/validation-*
- config_name: templated_amharic_qa
data_files:
- split: test
path: templated_amharic_qa/test-*
- split: train
path: templated_amharic_qa/train-*
- split: validation
path: templated_amharic_qa/validation-*
- config_name: templated_armenian_instruct
data_files:
- split: test
path: templated_armenian_instruct/test-*
- split: train
path: templated_armenian_instruct/train-*
- config_name: templated_bengali_news
data_files:
- split: train
path: templated_bengali_news/train-*
- config_name: templated_dutch_imdb
data_files:
- split: test
path: templated_dutch_imdb/test-*
- split: train
path: templated_dutch_imdb/train-*
- config_name: templated_hindi_headline
data_files:
- split: test
path: templated_hindi_headline/test-*
- split: train
path: templated_hindi_headline/train-*
- config_name: templated_hindi_news
data_files:
- split: test
path: templated_hindi_news/test-*
- split: train
path: templated_hindi_news/train-*
- config_name: templated_indic_paraphrase
data_files:
- split: train
path: templated_indic_paraphrase/train-*
- config_name: templated_indic_sentiment
data_files:
- split: train
path: templated_indic_sentiment/train-*
- config_name: templated_indo_stories
data_files:
- split: train
path: templated_indo_stories/train-*
- config_name: templated_japanese_instruct
data_files:
- split: train
path: templated_japanese_instruct/train-*
- config_name: templated_joke_explaination
data_files:
- split: train
path: templated_joke_explaination/train-*
- config_name: templated_ligurian_news
data_files:
- split: validation
path: templated_ligurian_news/validation-*
- split: test
path: templated_ligurian_news/test-*
- split: train
path: templated_ligurian_news/train-*
- config_name: templated_masakhanews
data_files:
- split: test
path: templated_masakhanews/test-*
- split: train
path: templated_masakhanews/train-*
- split: validation
path: templated_masakhanews/validation-*
- config_name: templated_mintaka
data_files:
- split: test
path: templated_mintaka/test-*
- split: train
path: templated_mintaka/train-*
- split: validation
path: templated_mintaka/validation-*
- config_name: templated_ntx_llm
data_files:
- split: train
path: templated_ntx_llm/train-*
- config_name: templated_nusax_senti
data_files:
- split: test
path: templated_nusax_senti/test-*
- split: train
path: templated_nusax_senti/train-*
- split: validation
path: templated_nusax_senti/validation-*
- config_name: templated_persian_farstail
data_files:
- split: test
path: templated_persian_farstail/test-*
- split: train
path: templated_persian_farstail/train-*
- split: validation
path: templated_persian_farstail/validation-*
- config_name: templated_persian_instruct
data_files:
- split: test
path: templated_persian_instruct/test-*
- split: train
path: templated_persian_instruct/train-*
- split: validation
path: templated_persian_instruct/validation-*
- config_name: templated_scirepeval
data_files:
- split: validation
path: templated_scirepeval/validation-*
- config_name: templated_seed_instruct
data_files:
- split: validation
path: templated_seed_instruct/validation-*
- split: test
path: templated_seed_instruct/test-*
- split: train
path: templated_seed_instruct/train-*
- config_name: templated_soda
data_files:
- split: test
path: templated_soda/test-*
- split: train
path: templated_soda/train-*
- split: validation
path: templated_soda/validation-*
- config_name: templated_tamil_stories
data_files:
- split: train
path: templated_tamil_stories/train-*
- config_name: templated_tamil_thirukkural
data_files:
- split: train
path: templated_tamil_thirukkural/train-*
- config_name: templated_telugu_food
data_files:
- split: train
path: templated_telugu_food/train-*
- config_name: templated_telugu_jokes
data_files:
- split: train
path: templated_telugu_jokes/train-*
- config_name: templated_telugu_news
data_files:
- split: train
path: templated_telugu_news/train-*
- config_name: templated_telugu_poems
data_files:
- split: train
path: templated_telugu_poems/train-*
- config_name: templated_telugu_riddles
data_files:
- split: train
path: templated_telugu_riddles/train-*
- config_name: templated_thai_pos
data_files:
- split: test
path: templated_thai_pos/test-*
- split: train
path: templated_thai_pos/train-*
- config_name: templated_thai_scb
data_files:
- split: test
path: templated_thai_scb/test-*
- split: train
path: templated_thai_scb/train-*
- split: validation
path: templated_thai_scb/validation-*
- config_name: templated_thai_usembassy
data_files:
- split: train
path: templated_thai_usembassy/train-*
- config_name: templated_thai_wikitionary
data_files:
- split: train
path: templated_thai_wikitionary/train-*
- config_name: templated_turku_paraphrase
data_files:
- split: test
path: templated_turku_paraphrase/test-*
- split: train
path: templated_turku_paraphrase/train-*
- split: validation
path: templated_turku_paraphrase/validation-*
- config_name: templated_ukranian_gec
data_files:
- split: train
path: templated_ukranian_gec/train-*
- config_name: templated_uner_llm
data_files:
- split: train
path: templated_uner_llm/train-*
- split: test
path: templated_uner_llm/test-*
- split: validation
path: templated_uner_llm/validation-*
- config_name: templated_urdu_news_category
data_files:
- split: test
path: templated_urdu_news_category/test-*
- split: train
path: templated_urdu_news_category/train-*
- config_name: templated_urdu_news_gen
data_files:
- split: test
path: templated_urdu_news_gen/test-*
- split: train
path: templated_urdu_news_gen/train-*
- config_name: templated_urdu_news_headline
data_files:
- split: test
path: templated_urdu_news_headline/test-*
- split: train
path: templated_urdu_news_headline/train-*
- config_name: templated_wiki_split
data_files:
- split: test
path: templated_wiki_split/test-*
- split: train
path: templated_wiki_split/train-*
- split: validation
path: templated_wiki_split/validation-*
- config_name: templated_xcsqa
data_files:
- split: validation
path: templated_xcsqa/validation-*
- config_name: templated_xlel_wd
data_files:
- split: test
path: templated_xlel_wd/test-*
- split: train
path: templated_xlel_wd/train-*
- split: validation
path: templated_xlel_wd/validation-*
- config_name: templated_xwikis
data_files:
- split: test
path: templated_xwikis/test-*
- split: train
path: templated_xwikis/train-*
- split: validation
path: templated_xwikis/validation-*
- config_name: translated_adversarial_qa
data_files:
- split: test
path: translated_adversarial_qa/test-*
- split: train
path: translated_adversarial_qa/train-*
- split: validation
path: translated_adversarial_qa/validation-*
- config_name: translated_cnn_dailymail
data_files:
- split: test
path: translated_cnn_dailymail/test-*
- split: train
path: translated_cnn_dailymail/train-*
- split: validation
path: translated_cnn_dailymail/validation-*
- config_name: translated_dolly
data_files:
- split: train
path: translated_dolly/train-*
- config_name: translated_flan_coqa
data_files:
- split: train
path: translated_flan_coqa/train-*
- config_name: translated_flan_cot
data_files:
- split: train
path: translated_flan_cot/train-*
- config_name: translated_flan_gem_wiki
data_files:
- split: train
path: translated_flan_gem_wiki/train-*
- config_name: translated_flan_lambada
data_files:
- split: train
path: translated_flan_lambada/train-*
- config_name: translated_flan_qa
data_files:
- split: train
path: translated_flan_qa/train-*
- config_name: translated_hotpotqa
data_files:
- split: train
path: translated_hotpotqa/train-*
- split: validation
path: translated_hotpotqa/validation-*
- config_name: translated_joke_explaination
data_files:
- split: train
path: translated_joke_explaination/train-*
- config_name: translated_mintaka
data_files:
- split: test
path: translated_mintaka/test-*
- split: train
path: translated_mintaka/train-*
- split: validation
path: translated_mintaka/validation-*
- config_name: translated_mlqa
data_files:
- split: test
path: translated_mlqa/test-*
- split: validation
path: translated_mlqa/validation-*
- config_name: translated_nqopen
data_files:
- split: train
path: translated_nqopen/train-*
- split: validation
path: translated_nqopen/validation-*
- config_name: translated_paws
data_files:
- split: test
path: translated_paws/test-*
- split: train
path: translated_paws/train-*
- split: validation
path: translated_paws/validation-*
- config_name: translated_piqa
data_files:
- split: train
path: translated_piqa/train-*
- split: validation
path: translated_piqa/validation-*
- config_name: translated_soda
data_files:
- split: test
path: translated_soda/test-*
- split: validation
path: translated_soda/validation-*
- split: train
path: translated_soda/train-*
- config_name: translated_wiki_split
data_files:
- split: test
path: translated_wiki_split/test-*
- split: train
path: translated_wiki_split/train-*
- split: validation
path: translated_wiki_split/validation-*
- config_name: translated_wikiqa
data_files:
- split: test
path: translated_wikiqa/test-*
- split: train
path: translated_wikiqa/train-*
- split: validation
path: translated_wikiqa/validation-*
- config_name: translated_xlel_wd
data_files:
- split: test
path: translated_xlel_wd/test-*
- split: validation
path: translated_xlel_wd/validation-*
- split: train
path: translated_xlel_wd/train-*
---

****This dataset is uploaded in two places: here and additionally [here](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) as 'Aya Collection Language Split.' These datasets are identical in content but differ in structure of upload. This dataset is structured by folders split according to dataset name. The version [here](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) instead divides the Aya collection into folders split by language. We recommend you use the language split version if you are only interested in downloading data for a single or smaller set of languages, and this version if you want to download dataset according to data source or the entire collection.****
# Dataset Summary
The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks.
This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection.
- **Curated by:** Contributors of [Aya Open Science Intiative](https://cohere.com/research/aya)
- **Language(s):** 115 languages
- **License:** [Apache 2.0](https://opensource.org/license/apache-2-0)
- **Aya Datasets Family:**
| Name | Explanation |
|------|--------------|
| [aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) | Human-annotated multilingual instruction finetuning dataset, comprising over 204K instances across 65 languages. |
| [aya_collection](https://huggingface.co/datasets/CohereLabs/aya_collection) | Created by applying instruction-style templates from fluent speakers to 44 datasets, including translations of 19 instruction-style datasets into 101 languages. This collection structured based on dataset level subsets. An alternative version of the collection structured by language subsets is also available.|
| [aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) | Aya Collection structured based on language level subsets. |
| [aya_evaluation_suite](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite) | A diverse evaluation set for multilingual open-ended generation, featuring 250 culturally grounded prompts in 7 languages, 200 translated prompts in 24 languages, and human-edited versions selected for cross-cultural relevance from English Dolly in 6 languages.|
| [aya_redteaming](https://huggingface.co/datasets/CohereLabs/aya_redteaming)| A red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.|
# Dataset
The `Aya Collection` is a comprehensive, large corpus of datasets that can be used by researchers around the world to train multilingual models. Our goal is only to include datasets with permissive licensing for manipulation and redistribution.
The `Aya Collection` consists of three different sources of data:
1. Templated data: We collaborated with fluent speakers to create templates that allowed for the automatic expansion of existing datasets into various languages.
2. Translated data: We translated a hand-selected subset of 19 datasets into 101 languages (114 dialects) using the NLLB 3.3B parameter machine translation model.
3. Aya Dataset: We release the [Aya Dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) as a subset of the overall collection. This is the only dataset in the collection that is human-annotated in its entirety.
## Load with Datasets
To load this dataset with Datasets, you'll need to install Datasets as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
dataset = load_dataset("CohereLabs/aya_collection", "templated_mintaka")
```
In the above code snippet, "templated_mintaka" refers to a subset of the aya_collection. You can load other subsets by specifying its name at the time of loading the dataset.
## Data Instances
An example of a `train` instance looks as follows:
```json
{'id': 246001,
'inputs': 'The following query in English is taken from the geography category. What could be the answer to the question?\nWhat is the seventh tallest mountain in North America?',
'targets': 'The answer is Mount Lucania.',
'dataset_name': 'Mintaka-inst',
'sub_dataset_name': '-',
'task_type': 'question-answering',
'template_id': 3,
'language': 'eng',
'split': 'train',
'script': 'Latn'
}
```
## Data Fields
The data fields are the same among all splits:
- `id:` Unique id of the data point
- `inputs:` Prompt or input to the language model.
- `targets:` Completion or output of the language model.
- `dataset_name:` The name of the source dataset that the data point was taken from
- `sub_dataset_name:` If the source is a collection, this field indicates which part of that collection the data point was taken from. If it is not a collection, this field is left blank.
- `task_type:` The task type that this conversation belongs to.
- `template_id`: The id of the template applied to this data point.
- `language:` The ISO code of the dialect of the conversation.
- `script:` The script of the language.
- `split:` Indicates whether the data point is part of the `train` or the `test` split.
### Statistics
The total number of data points, including the Aya Dataset` is 513,758,189. To view the breakdown of dialect codes and the respective templated and translated data point counts in the Aya Collection , refer to the toggled table below.
<details>
<summary> <b> Breakdown of Aya Collection data point counts grouped by dialects </b> </summary>
|dialect code|language|translated data point count|templated data point count|total count |
|------------|--------|---------------------------|--------------------------|---------------|
|ace |Achinese|8240684 |2000 |8242684 |
|acm |Arabic |4120342 |0 |4120342 |
|acq |Arabic |4120342 |0 |4120342 |
|aeb |Arabic |4120342 |0 |4120342 |
|afr |Afrikaans|4120342 |6108 |4126450 |
|ajp |Arabic |4120342 |0 |4120342 |
|als |Albanian|4120342 |0 |4120342 |
|amh |Amharic |4120342 |25327 |4145669 |
|apc |Arabic |4120342 |0 |4120342 |
|arb |Arabic |6424999 |216430 |6641429 |
|ars |Arabic |4120342 |0 |4120342 |
|ary |Arabic |4120342 |18076 |4138418 |
|arz |Arabic |4120342 |0 |4120342 |
|azb |Azerbaijani|4120342 |0 |4120342 |
|azj |Azerbaijani|4120342 |0 |4120342 |
|bel |Belarusian|4120342 |21273 |4141615 |
|ben |Bengali |4120342 |30661 |4151003 |
|bjn |Banjar |8240684 |2000 |8242684 |
|bul |Bulgarian|4120342 |37722 |4158064 |
|cat |Catalan |4120342 |66900 |4187242 |
|ceb |Cebuano |4120342 |0 |4120342 |
|ces |Czech |4120342 |179604 |4299946 |
|ckb |Kurdish |4120342 |0 |4120342 |
|cym |Welsh |4120342 |0 |4120342 |
|dan |Danish |4120342 |36310 |4156652 |
|deu |German |4120342 |1326722 |5447064 |
|ell |Greek |4120342 |40291 |4160633 |
|eng |English |9771427 |8066678 |17838105 |
|epo |Esperanto|4120342 |0 |4120342 |
|est |Estonian|4120342 |0 |4120342 |
|eus |Basque |4120342 |0 |4120342 |
|fin |Finnish |4120342 |457895 |4578237 |
|fra |French |4120342 |835520 |4955862 |
|gla |Scottish Gaelic|4120342 |0 |4120342 |
|gle |Irish |4120342 |0 |4120342 |
|glg |Galician|4120342 |0 |4120342 |
|guj |Gujarati|4120342 |2157 |4122499 |
|hat |Haitian Creole|4120342 |0 |4120342 |
|hau |Hausa |4120342 |51396 |4171738 |
|heb |Hebrew |4120342 |103466 |4223808 |
|hin |Hindi |4120342 |260387 |4380729 |
|hun |Hungarian|4120342 |82039 |4202381 |
|hye |Armenian|4120342 |7080 |4127422 |
|ibo |Igbo |4120342 |36312 |4156654 |
|ind |Indonesian|4120342 |45709 |4166051 |
|isl |Icelandic|4120342 |0 |4120342 |
|ita |Italian |4120342 |405682 |4526024 |
|jav |Javanese|4120342 |829 |4121171 |
|jpn |Japanese|4120342 |2693177 |6813519 |
|kan |Kannada |4120342 |1156 |4121498 |
|kas |Kashmiri|4120342 |0 |4120342 |
|kat |Georgian|4120342 |0 |4120342 |
|kaz |Kazakh |4120342 |0 |4120342 |
|khk |Mongolian|4120342 |0 |4120342 |
|khm |Khmer |4120342 |0 |4120342 |
|kir |Kyrgyz |4120342 |0 |4120342 |
|kmr |Kurdish |4120342 |0 |4120342 |
|knc |Kanuri |8240684 |0 |8240684 |
|kor |Korean |4120342 |41011 |4161353 |
|lao |Lao |4120342 |0 |4120342 |
|lit |Lithuanian|4120342 |0 |4120342 |
|ltz |Luxembourgish|4120342 |0 |4120342 |
|lvs |Latvian |4120342 |0 |4120342 |
|mal |Malayalam|4120342 |4347 |4124689 |
|mar |Marathi |4120342 |3678 |4124020 |
|min |Minangkabau|6753788 |2000 |6755788 |
|mkd |Macedonian|4120342 |0 |4120342 |
|mlt |Maltese |4120342 |0 |4120342 |
|mni |Manipuri|4120342 |0 |4120342 |
|mri |Maori |4120342 |0 |4120342 |
|mya |Burmese |4120342 |0 |4120342 |
|nld |Dutch |4120342 |220181 |4340523 |
|nno |Norwegian|4120342 |0 |4120342 |
|nob |Norwegian|4120342 |0 |4120342 |
|npi |Nepali |4120342 |0 |4120342 |
|nso |Northern Sotho|4120342 |0 |4120342 |
|pbt |Pashto |4120342 |0 |4120342 |
|pes |Persian |4120342 |245520 |4365862 |
|plt |Malagasy|4120342 |0 |4120342 |
|pol |Polish |4120342 |332503 |4452845 |
|por |Portuguese|4120342 |287432 |4407774 |
|ron |Romanian|4120342 |36359 |4156701 |
|rus |Russian |4120342 |545920 |4666262 |
|sin |Sinhala |4120342 |195 |4120537 |
|slk |Slovak |4120342 |27845 |4148187 |
|slv |Slovenian|4120342 |25731 |4146073 |
|smo |Samoan |4120342 |0 |4120342 |
|sna |Shona |4120342 |3684 |4124026 |
|snd |Sindhi |4120342 |0 |4120342 |
|som |Somali |4120342 |2926 |4123268 |
|sot |Southern Sotho|4120342 |0 |4120342 |
|spa |Spanish |4120342 |379194 |4499536 |
|srp |Serbian |4120342 |77124 |4197466 |
|sun |Sundanese|4120342 |2208 |4122550 |
|swe |Swedish |4120342 |76486 |4196828 |
|swh |Swahili |4120342 |12726 |4133068 |
|tam |Tamil |4120342 |11462 |4131804 |
|taq |Tamasheq|4120342 |0 |4120342 |
|tel |Telugu |4120342 |477821 |4598163 |
|tgk |Tajik |4120342 |0 |4120342 |
|tha |Thai |4120342 |2125180 |6245522 |
|tur |Turkish |4120342 |59932 |4180274 |
|ukr |Ukrainian|4120342 |189384 |4309726 |
|urd |Urdu |4120342 |337739 |4458081 |
|uzn |Uzbek |4120342 |0 |4120342 |
|vie |Vietnamese|4120342 |42232 |4162574 |
|xho |Xhosa |4120342 |2952 |4123294 |
|ydd |Yiddish |4120342 |0 |4120342 |
|yor |Yoruba |4120342 |4907 |4125249 |
|yue |Chinese |4120342 |0 |4120342 |
|zho-Hans |Chinese |4120342 |54528 |4174870 |
|zho-Hant |Chinese |4120342 |0 |4120342 |
|zsm |Malay |4120342 |13950 |4134292 |
|zul |Zulu |4120342 |786 |4121128 |
|arq |Arabic |0 |6046 |6046 |
|ban |Balinese|0 |2000 |2000 |
|bbc |Toba Batak|0 |2000 |2000 |
|bem |Bemba |0 |776 |776 |
|fil |Filipino|0 |220 |220 |
|fon |Fon |0 |845 |845 |
|hrv |Croatian|0 |9007 |9007 |
|kin |Kinyarwanda|0 |11165 |11165 |
|lij |Ligurian|0 |6409 |6409 |
|mad |Madurese|0 |2000 |2000 |
|nij |Ngaju |0 |2000 |2000 |
|nor |Norwegian|0 |72352 |72352 |
|pan |Punjabi |0 |2156 |2156 |
|twi |Twi |0 |10840 |10840 |
|wol |Wolof |0 |785 |785 |
|zho |Chinese |0 |74972 |74972 |
PS: Templated data also includes Mozambican Portuguese, which doesn't have its own ISO language code.
</details>
<br>
# Motivations & Intentions
- **Curation Rationale:** Automatic augmentation of existing datasets serves to enhance the available linguistic resources for multiple languages. The list of languages was initially established from mT5 and aligned with the annotators’ language list and NLLB translation model. The datasets were translated directly from English for all languages.
# Additional Information
## Provenance
- **Methods Used:** A combination of crowd-sourced templating and automatic translation was employed to source this dataset.
- **Methodology Details:**
- *Source:* Existing NLP datasets
- *Dates of Collection:* May 2023 - Dec 2023
## Dataset Version and Maintenance
- **Maintenance Status:** Actively Maintained
- **Version Details:**
- *Current version:* 1.0
- *Last Update:* 02/2024
- *First Release:* 02/2024
## Authorship
- **Publishing Organization:** [Cohere Labs](https://cohere.com/research)
- **Industry Type:** Not-for-profit - Tech
- **Contact Details:** https://cohere.com/research/aya
## Licensing Information
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License.
## Citation Information
```bibtex
@misc{singh2024aya,
title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning},
author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker},
year={2024},
eprint={2402.06619},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
janavivekariya



