five

oscar-corpus/oscar

收藏
Hugging Face2024-03-21 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/oscar-corpus/oscar
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: OSCAR annotations_creators: - no-annotation language_creators: - found language: - af - als - am - an - ar - arz - as - ast - av - az - azb - ba - bar - bcl - be - bg - bh - bn - bo - bpy - br - bs - bxr - ca - cbk - ce - ceb - ckb - cs - cv - cy - da - de - diq - dsb - dv - el - eml - en - eo - es - et - eu - fa - fi - fr - frr - fy - ga - gd - gl - gn - gom - gu - he - hi - hr - hsb - ht - hu - hy - ia - id - ie - ilo - io - is - it - ja - jbo - jv - ka - kk - km - kn - ko - krc - ku - kv - kw - ky - la - lb - lez - li - lmo - lo - lrc - lt - lv - mai - mg - mhr - min - mk - ml - mn - mr - mrj - ms - mt - mwl - my - myv - mzn - nah - nap - nds - ne - new - nl - nn - 'no' - oc - or - os - pa - pam - pl - pms - pnb - ps - pt - qu - rm - ro - ru - sa - sah - scn - sd - sh - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - tg - th - tk - tl - tr - tt - tyv - ug - uk - ur - uz - vec - vi - vo - wa - war - wuu - xal - xmf - yi - yo - yue - zh license: - cc0-1.0 multilinguality: - multilingual size_categories: - 100K<n<1M - 100M<n<1B - 10K<n<100K - 10M<n<100M - 1K<n<10K - 1M<n<10M - n<1K source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: oscar dataset_info: - config_name: unshuffled_deduplicated_af features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 171320914 num_examples: 130640 download_size: 65989254 dataset_size: 171320914 - config_name: unshuffled_deduplicated_als features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2915912 num_examples: 4518 download_size: 1263294 dataset_size: 2915912 - config_name: unshuffled_deduplicated_arz features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 34893248 num_examples: 79928 download_size: 10027493 dataset_size: 34893248 - config_name: unshuffled_deduplicated_an features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 842246 num_examples: 2025 download_size: 133373 dataset_size: 842246 - config_name: unshuffled_deduplicated_ast features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2150022 num_examples: 5343 download_size: 856177 dataset_size: 2150022 - config_name: unshuffled_deduplicated_ba features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 93623739 num_examples: 27050 download_size: 25983491 dataset_size: 93623739 - config_name: unshuffled_deduplicated_am features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 215618603 num_examples: 43102 download_size: 61347279 dataset_size: 215618603 - config_name: unshuffled_deduplicated_as features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 73989818 num_examples: 9212 download_size: 15513004 dataset_size: 73989818 - config_name: unshuffled_deduplicated_azb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 20001183 num_examples: 9985 download_size: 5191704 dataset_size: 20001183 - config_name: unshuffled_deduplicated_be features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1077152244 num_examples: 307405 download_size: 306700943 dataset_size: 1077152244 - config_name: unshuffled_deduplicated_bo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 144506264 num_examples: 15762 download_size: 22365048 dataset_size: 144506264 - config_name: unshuffled_deduplicated_bxr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 11325 num_examples: 36 download_size: 3666 dataset_size: 11325 - config_name: unshuffled_deduplicated_ceb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 24439249 num_examples: 26145 download_size: 7124786 dataset_size: 24439249 - config_name: unshuffled_deduplicated_az features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1526935070 num_examples: 626796 download_size: 521744076 dataset_size: 1526935070 - config_name: unshuffled_deduplicated_bcl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 900 num_examples: 1 download_size: 594 dataset_size: 900 - config_name: unshuffled_deduplicated_cy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 140412555 num_examples: 98225 download_size: 53629697 dataset_size: 140412555 - config_name: unshuffled_deduplicated_dsb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 7589 num_examples: 37 download_size: 3640 dataset_size: 7589 - config_name: unshuffled_deduplicated_bn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6233041155 num_examples: 1114481 download_size: 1257218381 dataset_size: 6233041155 - config_name: unshuffled_deduplicated_bs features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 125977 num_examples: 702 download_size: 38669 dataset_size: 125977 - config_name: unshuffled_deduplicated_ce features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 7021674 num_examples: 2984 download_size: 1862792 dataset_size: 7021674 - config_name: unshuffled_deduplicated_cv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 27359554 num_examples: 10130 download_size: 7461982 dataset_size: 27359554 - config_name: unshuffled_deduplicated_diq features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 161 num_examples: 1 download_size: 331 dataset_size: 161 - config_name: unshuffled_deduplicated_eml features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 24657 num_examples: 80 download_size: 10055 dataset_size: 24657 - config_name: unshuffled_deduplicated_et features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2434152666 num_examples: 1172041 download_size: 966785545 dataset_size: 2434152666 - config_name: unshuffled_deduplicated_bg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 14420684170 num_examples: 3398679 download_size: 3848659853 dataset_size: 14420684170 - config_name: unshuffled_deduplicated_bpy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1725535 num_examples: 1770 download_size: 191472 dataset_size: 1725535 - config_name: unshuffled_deduplicated_ca features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4544123629 num_examples: 2458067 download_size: 1734548117 dataset_size: 4544123629 - config_name: unshuffled_deduplicated_ckb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 237229156 num_examples: 68210 download_size: 60319928 dataset_size: 237229156 - config_name: unshuffled_deduplicated_ar features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 33468271639 num_examples: 9006977 download_size: 9667185012 dataset_size: 33468271639 - config_name: unshuffled_deduplicated_av features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 334755 num_examples: 360 download_size: 75341 dataset_size: 334755 - config_name: unshuffled_deduplicated_bar features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 551 num_examples: 4 download_size: 354 dataset_size: 551 - config_name: unshuffled_deduplicated_bh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 35216 num_examples: 82 download_size: 6003 dataset_size: 35216 - config_name: unshuffled_deduplicated_br features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 16712284 num_examples: 14724 download_size: 6468062 dataset_size: 16712284 - config_name: unshuffled_deduplicated_cbk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 535 num_examples: 1 download_size: 247 dataset_size: 535 - config_name: unshuffled_deduplicated_da features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10204168604 num_examples: 4771098 download_size: 3816376656 dataset_size: 10204168604 - config_name: unshuffled_deduplicated_dv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 82122241 num_examples: 17024 download_size: 16836170 dataset_size: 82122241 - config_name: unshuffled_deduplicated_eo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 239597935 num_examples: 84752 download_size: 92858714 dataset_size: 239597935 - config_name: unshuffled_deduplicated_fa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 39986583410 num_examples: 8203495 download_size: 10459318520 dataset_size: 39986583410 - config_name: unshuffled_deduplicated_fy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 26562554 num_examples: 20661 download_size: 10270434 dataset_size: 26562554 - config_name: unshuffled_deduplicated_gn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 24545 num_examples: 68 download_size: 9566 dataset_size: 24545 - config_name: unshuffled_deduplicated_cs features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 25590158564 num_examples: 12308039 download_size: 10494256383 dataset_size: 25590158564 - config_name: unshuffled_deduplicated_hi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 9550345517 num_examples: 1909387 download_size: 2007441283 dataset_size: 9550345517 - config_name: unshuffled_deduplicated_hu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 19027456462 num_examples: 6582908 download_size: 7368098962 dataset_size: 19027456462 - config_name: unshuffled_deduplicated_ie features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1688 num_examples: 11 download_size: 649 dataset_size: 1688 - config_name: unshuffled_deduplicated_fr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 147774253219 num_examples: 59448891 download_size: 55462770729 dataset_size: 147774253219 - config_name: unshuffled_deduplicated_gd features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1339050 num_examples: 3883 download_size: 420601 dataset_size: 1339050 - config_name: unshuffled_deduplicated_gu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 758319353 num_examples: 169834 download_size: 162974870 dataset_size: 758319353 - config_name: unshuffled_deduplicated_hsb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1821734 num_examples: 3084 download_size: 728158 dataset_size: 1821734 - config_name: unshuffled_deduplicated_ia features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 373710 num_examples: 529 download_size: 52722 dataset_size: 373710 - config_name: unshuffled_deduplicated_io features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 139493 num_examples: 617 download_size: 42813 dataset_size: 139493 - config_name: unshuffled_deduplicated_jbo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 700428 num_examples: 617 download_size: 203506 dataset_size: 700428 - config_name: unshuffled_deduplicated_km features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 609886370 num_examples: 108346 download_size: 114480044 dataset_size: 609886370 - config_name: unshuffled_deduplicated_ku features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 62855449 num_examples: 29054 download_size: 23343869 dataset_size: 62855449 - config_name: unshuffled_deduplicated_la features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8867995 num_examples: 18808 download_size: 3421499 dataset_size: 8867995 - config_name: unshuffled_deduplicated_lmo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 458386 num_examples: 1374 download_size: 106048 dataset_size: 458386 - config_name: unshuffled_deduplicated_lv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1895693807 num_examples: 843195 download_size: 710448932 dataset_size: 1895693807 - config_name: unshuffled_deduplicated_min features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 318749 num_examples: 166 download_size: 10233 dataset_size: 318749 - config_name: unshuffled_deduplicated_mr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1487944837 num_examples: 212556 download_size: 299680349 dataset_size: 1487944837 - config_name: unshuffled_deduplicated_mwl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1121 num_examples: 7 download_size: 797 dataset_size: 1121 - config_name: unshuffled_deduplicated_nah features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 11540 num_examples: 58 download_size: 2868 dataset_size: 11540 - config_name: unshuffled_deduplicated_new features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4226557 num_examples: 2126 download_size: 830767 dataset_size: 4226557 - config_name: unshuffled_deduplicated_oc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3938772 num_examples: 6485 download_size: 1338194 dataset_size: 3938772 - config_name: unshuffled_deduplicated_pam features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 319 num_examples: 1 download_size: 366 dataset_size: 319 - config_name: unshuffled_deduplicated_ps features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 254360032 num_examples: 67921 download_size: 71823163 dataset_size: 254360032 - config_name: unshuffled_deduplicated_it features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 73843292670 num_examples: 28522082 download_size: 27931571784 dataset_size: 73843292670 - config_name: unshuffled_deduplicated_ka features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1982841952 num_examples: 372158 download_size: 377220437 dataset_size: 1982841952 - config_name: unshuffled_deduplicated_ro features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 11601264185 num_examples: 5044757 download_size: 4478423935 dataset_size: 11601264185 - config_name: unshuffled_deduplicated_scn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2990 num_examples: 17 download_size: 1620 dataset_size: 2990 - config_name: unshuffled_deduplicated_ko features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 11956006533 num_examples: 3675420 download_size: 4462788278 dataset_size: 11956006533 - config_name: unshuffled_deduplicated_kw features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 14971 num_examples: 68 download_size: 6195 dataset_size: 14971 - config_name: unshuffled_deduplicated_lez features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3075326 num_examples: 1381 download_size: 763936 dataset_size: 3075326 - config_name: unshuffled_deduplicated_lrc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 65291 num_examples: 72 download_size: 16272 dataset_size: 65291 - config_name: unshuffled_deduplicated_mg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13516085 num_examples: 13343 download_size: 4303472 dataset_size: 13516085 - config_name: unshuffled_deduplicated_ml features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2685637627 num_examples: 453904 download_size: 496801596 dataset_size: 2685637627 - config_name: unshuffled_deduplicated_ms features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 45064684 num_examples: 183443 download_size: 16391407 dataset_size: 45064684 - config_name: unshuffled_deduplicated_myv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1224 num_examples: 5 download_size: 705 dataset_size: 1224 - config_name: unshuffled_deduplicated_nds features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13360483 num_examples: 8714 download_size: 5271194 dataset_size: 13360483 - config_name: unshuffled_deduplicated_nn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 57286159 num_examples: 109118 download_size: 23583774 dataset_size: 57286159 - config_name: unshuffled_deduplicated_os features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10962689 num_examples: 2559 download_size: 2829131 dataset_size: 10962689 - config_name: unshuffled_deduplicated_pms features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1996853 num_examples: 2859 download_size: 716837 dataset_size: 1996853 - config_name: unshuffled_deduplicated_qu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 72587 num_examples: 411 download_size: 17501 dataset_size: 72587 - config_name: unshuffled_deduplicated_sa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 38236039 num_examples: 7121 download_size: 7268337 dataset_size: 38236039 - config_name: unshuffled_deduplicated_sk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4768416160 num_examples: 2820821 download_size: 1960409934 dataset_size: 4768416160 - config_name: unshuffled_deduplicated_sh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6184582 num_examples: 17610 download_size: 1445894 dataset_size: 6184582 - config_name: unshuffled_deduplicated_so features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 16269 num_examples: 42 download_size: 2109 dataset_size: 16269 - config_name: unshuffled_deduplicated_sr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2358255234 num_examples: 645747 download_size: 665025000 dataset_size: 2358255234 - config_name: unshuffled_deduplicated_ta features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5477003981 num_examples: 833101 download_size: 971118176 dataset_size: 5477003981 - config_name: unshuffled_deduplicated_tk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 7092199 num_examples: 4694 download_size: 2219582 dataset_size: 7092199 - config_name: unshuffled_deduplicated_tyv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8319 num_examples: 24 download_size: 2976 dataset_size: 8319 - config_name: unshuffled_deduplicated_uz features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 11834927 num_examples: 15074 download_size: 4300299 dataset_size: 11834927 - config_name: unshuffled_deduplicated_wa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 214337 num_examples: 677 download_size: 79130 dataset_size: 214337 - config_name: unshuffled_deduplicated_xmf features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4617445 num_examples: 2418 download_size: 943151 dataset_size: 4617445 - config_name: unshuffled_deduplicated_sv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 26239415574 num_examples: 11014487 download_size: 10185393483 dataset_size: 26239415574 - config_name: unshuffled_deduplicated_tg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 261233997 num_examples: 56259 download_size: 62908723 dataset_size: 261233997 - config_name: unshuffled_deduplicated_de features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 155723559907 num_examples: 62398034 download_size: 60797849113 dataset_size: 155723559907 - config_name: unshuffled_deduplicated_tr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 28375018927 num_examples: 11596446 download_size: 10390754678 dataset_size: 28375018927 - config_name: unshuffled_deduplicated_el features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 28689398676 num_examples: 6521169 download_size: 7907952068 dataset_size: 28689398676 - config_name: unshuffled_deduplicated_uk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 29791312367 num_examples: 7782375 download_size: 8037737457 dataset_size: 29791312367 - config_name: unshuffled_deduplicated_vi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 33528331774 num_examples: 9897709 download_size: 10711506712 dataset_size: 33528331774 - config_name: unshuffled_deduplicated_wuu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 33253 num_examples: 64 download_size: 7273 dataset_size: 33253 - config_name: unshuffled_deduplicated_yo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 27169 num_examples: 49 download_size: 8925 dataset_size: 27169 - config_name: unshuffled_original_als features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5297910 num_examples: 7324 download_size: 1489734 dataset_size: 5297910 - config_name: unshuffled_original_arz features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 70132423 num_examples: 158113 download_size: 15891255 dataset_size: 70132423 - config_name: unshuffled_original_az features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2964781192 num_examples: 912330 download_size: 927763846 dataset_size: 2964781192 - config_name: unshuffled_original_bcl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 901 num_examples: 1 download_size: 581 dataset_size: 901 - config_name: unshuffled_original_bn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10771945233 num_examples: 1675515 download_size: 2139944099 dataset_size: 10771945233 - config_name: unshuffled_original_bs features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 482740 num_examples: 2143 download_size: 56419 dataset_size: 482740 - config_name: unshuffled_original_ce features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8735740 num_examples: 4042 download_size: 2089184 dataset_size: 8735740 - config_name: unshuffled_original_cv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 41047029 num_examples: 20281 download_size: 9400068 dataset_size: 41047029 - config_name: unshuffled_original_diq features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 162 num_examples: 1 download_size: 318 dataset_size: 162 - config_name: unshuffled_original_eml features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 26099 num_examples: 84 download_size: 10071 dataset_size: 26099 - config_name: unshuffled_original_et features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5174800705 num_examples: 2093621 download_size: 1881328631 dataset_size: 5174800705 - config_name: unshuffled_deduplicated_zh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 267614324325 num_examples: 41708901 download_size: 99982781539 dataset_size: 267614324325 - config_name: unshuffled_original_an features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1329433 num_examples: 2449 download_size: 148184 dataset_size: 1329433 - config_name: unshuffled_original_ast features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2539238 num_examples: 6999 download_size: 920730 dataset_size: 2539238 - config_name: unshuffled_original_ba features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 133704014 num_examples: 42551 download_size: 33215002 dataset_size: 133704014 - config_name: unshuffled_original_bg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 33753811450 num_examples: 5869686 download_size: 8336964541 dataset_size: 33753811450 - config_name: unshuffled_original_bpy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4347467 num_examples: 6046 download_size: 336974 dataset_size: 4347467 - config_name: unshuffled_original_ca features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8623251470 num_examples: 4390754 download_size: 3101954304 dataset_size: 8623251470 - config_name: unshuffled_original_ckb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 510965919 num_examples: 103639 download_size: 111884006 dataset_size: 510965919 - config_name: unshuffled_deduplicated_es features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 160418075023 num_examples: 56326016 download_size: 60464970319 dataset_size: 160418075023 - config_name: unshuffled_original_da features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 16756455589 num_examples: 7664010 download_size: 6000579388 dataset_size: 16756455589 - config_name: unshuffled_original_dv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 131628992 num_examples: 21018 download_size: 24914404 dataset_size: 131628992 - config_name: unshuffled_original_eo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 314188336 num_examples: 121168 download_size: 117076019 dataset_size: 314188336 - config_name: unshuffled_deduplicated_fi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13945067515 num_examples: 5326443 download_size: 5380047103 dataset_size: 13945067515 - config_name: unshuffled_deduplicated_ga features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 63370688 num_examples: 46493 download_size: 22218633 dataset_size: 63370688 - config_name: unshuffled_deduplicated_gom features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1863089 num_examples: 484 download_size: 377051 dataset_size: 1863089 - config_name: unshuffled_deduplicated_hr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 118047678 num_examples: 321484 download_size: 46731365 dataset_size: 118047678 - config_name: unshuffled_deduplicated_hy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1559114836 num_examples: 396093 download_size: 393620208 dataset_size: 1559114836 - config_name: unshuffled_deduplicated_ilo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 667896 num_examples: 1578 download_size: 230065 dataset_size: 667896 - config_name: unshuffled_original_fa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 84209448803 num_examples: 13704702 download_size: 20956409096 dataset_size: 84209448803 - config_name: unshuffled_original_fy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 36238452 num_examples: 33053 download_size: 12409774 dataset_size: 36238452 - config_name: unshuffled_original_gn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 37427 num_examples: 106 download_size: 9761 dataset_size: 37427 - config_name: unshuffled_original_hi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17929286362 num_examples: 3264660 download_size: 3656636848 dataset_size: 17929286362 - config_name: unshuffled_original_hu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 43074893842 num_examples: 11197780 download_size: 15693847091 dataset_size: 43074893842 - config_name: unshuffled_original_ie features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 25355 num_examples: 101 download_size: 783 dataset_size: 25355 - config_name: unshuffled_deduplicated_ja features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 113315056833 num_examples: 39496439 download_size: 40801218295 dataset_size: 113315056833 - config_name: unshuffled_deduplicated_kk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1583064520 num_examples: 338073 download_size: 389111715 dataset_size: 1583064520 - config_name: unshuffled_deduplicated_krc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2412731 num_examples: 1377 download_size: 615982 dataset_size: 2412731 - config_name: unshuffled_deduplicated_ky features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 407576051 num_examples: 86561 download_size: 106219565 dataset_size: 407576051 - config_name: unshuffled_deduplicated_li features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 28176 num_examples: 118 download_size: 11724 dataset_size: 28176 - config_name: unshuffled_deduplicated_lt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4185372402 num_examples: 1737411 download_size: 1653025558 dataset_size: 4185372402 - config_name: unshuffled_deduplicated_mhr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6247177 num_examples: 2515 download_size: 1622076 dataset_size: 6247177 - config_name: unshuffled_deduplicated_mn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 880883961 num_examples: 197878 download_size: 219516471 dataset_size: 880883961 - config_name: unshuffled_deduplicated_mt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17539926 num_examples: 16383 download_size: 5898934 dataset_size: 17539926 - config_name: unshuffled_deduplicated_mzn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 626534 num_examples: 917 download_size: 157541 dataset_size: 626534 - config_name: unshuffled_deduplicated_ne features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1239170286 num_examples: 219334 download_size: 240627361 dataset_size: 1239170286 - config_name: unshuffled_deduplicated_no features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5077919278 num_examples: 3229940 download_size: 1960828800 dataset_size: 5077919278 - config_name: unshuffled_deduplicated_pa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 482461302 num_examples: 87235 download_size: 102390579 dataset_size: 482461302 - config_name: unshuffled_deduplicated_pnb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 9416915 num_examples: 3463 download_size: 2579976 dataset_size: 9416915 - config_name: unshuffled_deduplicated_rm features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6932 num_examples: 34 download_size: 2679 dataset_size: 6932 - config_name: unshuffled_deduplicated_sah features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 27293316 num_examples: 8555 download_size: 7020207 dataset_size: 27293316 - config_name: unshuffled_deduplicated_si features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 841460012 num_examples: 120684 download_size: 175610997 dataset_size: 841460012 - config_name: unshuffled_deduplicated_sq features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1208425681 num_examples: 461598 download_size: 445358539 dataset_size: 1208425681 - config_name: unshuffled_deduplicated_sw features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8747758 num_examples: 24803 download_size: 2946034 dataset_size: 8747758 - config_name: unshuffled_deduplicated_th features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17082022564 num_examples: 3749826 download_size: 3536468931 dataset_size: 17082022564 - config_name: unshuffled_deduplicated_tt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 320641922 num_examples: 82738 download_size: 85893621 dataset_size: 320641922 - config_name: unshuffled_deduplicated_ur features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1819253063 num_examples: 428674 download_size: 483593818 dataset_size: 1819253063 - config_name: unshuffled_deduplicated_vo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2098461 num_examples: 3317 download_size: 301687 dataset_size: 2098461 - config_name: unshuffled_deduplicated_xal features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 114574 num_examples: 36 download_size: 31863 dataset_size: 114574 - config_name: unshuffled_deduplicated_yue features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2267 num_examples: 7 download_size: 646 dataset_size: 2267 - config_name: unshuffled_original_am features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 378060369 num_examples: 83663 download_size: 102789518 dataset_size: 378060369 - config_name: unshuffled_original_as features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 117733678 num_examples: 14985 download_size: 21437245 dataset_size: 117733678 - config_name: unshuffled_original_azb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 28469069 num_examples: 15446 download_size: 6641415 dataset_size: 28469069 - config_name: unshuffled_original_be features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1877972506 num_examples: 586031 download_size: 498295673 dataset_size: 1877972506 - config_name: unshuffled_original_bo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 195400209 num_examples: 26795 download_size: 28940995 dataset_size: 195400209 - config_name: unshuffled_original_bxr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13376 num_examples: 42 download_size: 3688 dataset_size: 13376 - config_name: unshuffled_original_ceb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 40964537 num_examples: 56248 download_size: 11070392 dataset_size: 40964537 - config_name: unshuffled_original_cy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 224933804 num_examples: 157698 download_size: 81736037 dataset_size: 224933804 - config_name: unshuffled_original_dsb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13761 num_examples: 65 download_size: 3753 dataset_size: 13761 - config_name: unshuffled_original_fr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 303190338653 num_examples: 96742378 download_size: 105324330228 dataset_size: 303190338653 - config_name: unshuffled_original_gd features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2022000 num_examples: 5799 download_size: 525253 dataset_size: 2022000 - config_name: unshuffled_original_gu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1094814909 num_examples: 240691 download_size: 232021129 dataset_size: 1094814909 - config_name: unshuffled_original_hsb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4482886 num_examples: 7959 download_size: 1389826 dataset_size: 4482886 - config_name: unshuffled_original_ia features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 689455 num_examples: 1040 download_size: 83325 dataset_size: 689455 - config_name: unshuffled_original_io features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 158808 num_examples: 694 download_size: 44548 dataset_size: 158808 - config_name: unshuffled_original_jbo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 763027 num_examples: 832 download_size: 212962 dataset_size: 763027 - config_name: unshuffled_original_km features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1102616385 num_examples: 159363 download_size: 193286621 dataset_size: 1102616385 - config_name: unshuffled_original_ku features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 99062676 num_examples: 46535 download_size: 33376537 dataset_size: 99062676 - config_name: unshuffled_original_la features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 27801400 num_examples: 94588 download_size: 5458131 dataset_size: 27801400 - config_name: unshuffled_original_lmo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 470001 num_examples: 1401 download_size: 109759 dataset_size: 470001 - config_name: unshuffled_original_lv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4266812625 num_examples: 1593820 download_size: 1486675302 dataset_size: 4266812625 - config_name: unshuffled_original_min features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 624991 num_examples: 220 download_size: 12379 dataset_size: 624991 - config_name: unshuffled_original_mr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2816455519 num_examples: 326804 download_size: 525303459 dataset_size: 2816455519 - config_name: unshuffled_original_mwl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1273 num_examples: 8 download_size: 789 dataset_size: 1273 - config_name: unshuffled_original_nah features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 12070 num_examples: 61 download_size: 2857 dataset_size: 12070 - config_name: unshuffled_original_new features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5766053 num_examples: 4696 download_size: 1031042 dataset_size: 5766053 - config_name: unshuffled_original_oc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6127539 num_examples: 10709 download_size: 1574956 dataset_size: 6127539 - config_name: unshuffled_original_pam features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 800 num_examples: 3 download_size: 364 dataset_size: 800 - config_name: unshuffled_original_ps features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 379515973 num_examples: 98216 download_size: 103659691 dataset_size: 379515973 - config_name: unshuffled_original_ro features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 26869251055 num_examples: 9387265 download_size: 9534521905 dataset_size: 26869251055 - config_name: unshuffled_original_scn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3573 num_examples: 21 download_size: 1614 dataset_size: 3573 - config_name: unshuffled_original_sk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 9808179461 num_examples: 5492194 download_size: 3708313186 dataset_size: 9808179461 - config_name: unshuffled_original_sr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4131922671 num_examples: 1013619 download_size: 1081129678 dataset_size: 4131922671 - config_name: unshuffled_original_ta features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 9933590150 num_examples: 1263280 download_size: 1737252172 dataset_size: 9933590150 - config_name: unshuffled_original_tk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10662991 num_examples: 6456 download_size: 2956150 dataset_size: 10662991 - config_name: unshuffled_original_tyv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 12219 num_examples: 34 download_size: 3034 dataset_size: 12219 - config_name: unshuffled_original_uz features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 21464779 num_examples: 27537 download_size: 5775644 dataset_size: 21464779 - config_name: unshuffled_original_wa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 291400 num_examples: 1001 download_size: 89942 dataset_size: 291400 - config_name: unshuffled_original_xmf features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 6120123 num_examples: 3783 download_size: 1048265 dataset_size: 6120123 - config_name: unshuffled_original_it features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 147378116499 num_examples: 46981781 download_size: 52157691650 dataset_size: 147378116499 - config_name: unshuffled_original_ka features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3768832240 num_examples: 563916 download_size: 680732710 dataset_size: 3768832240 - config_name: unshuffled_original_ko features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 25292102197 num_examples: 7345075 download_size: 8807937093 dataset_size: 25292102197 - config_name: unshuffled_original_kw features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 47016 num_examples: 203 download_size: 6715 dataset_size: 47016 - config_name: unshuffled_original_lez features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3378104 num_examples: 1485 download_size: 825648 dataset_size: 3378104 - config_name: unshuffled_original_lrc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 78347 num_examples: 88 download_size: 16573 dataset_size: 78347 - config_name: unshuffled_original_mg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 21789998 num_examples: 17957 download_size: 6213316 dataset_size: 21789998 - config_name: unshuffled_original_ml features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 5244279375 num_examples: 603937 download_size: 938681749 dataset_size: 5244279375 - config_name: unshuffled_original_ms features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 122326270 num_examples: 534016 download_size: 28458804 dataset_size: 122326270 - config_name: unshuffled_original_myv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1436 num_examples: 6 download_size: 691 dataset_size: 1436 - config_name: unshuffled_original_nds features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 18238189 num_examples: 18174 download_size: 6744705 dataset_size: 18238189 - config_name: unshuffled_original_nn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 90838777 num_examples: 185884 download_size: 32863375 dataset_size: 90838777 - config_name: unshuffled_original_os features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 12893477 num_examples: 5213 download_size: 3096133 dataset_size: 12893477 - config_name: unshuffled_original_pms features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2154710 num_examples: 3225 download_size: 756400 dataset_size: 2154710 - config_name: unshuffled_original_qu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 85032 num_examples: 452 download_size: 17931 dataset_size: 85032 - config_name: unshuffled_original_sa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 97055224 num_examples: 14291 download_size: 17517475 dataset_size: 97055224 - config_name: unshuffled_original_sh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 25841505 num_examples: 36700 download_size: 3457359 dataset_size: 25841505 - config_name: unshuffled_original_so features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 63785 num_examples: 156 download_size: 2478 dataset_size: 63785 - config_name: unshuffled_original_sv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 47000933560 num_examples: 17395625 download_size: 17182697021 dataset_size: 47000933560 - config_name: unshuffled_original_tg features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 397436494 num_examples: 89002 download_size: 90972727 dataset_size: 397436494 - config_name: unshuffled_original_tr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 63581153419 num_examples: 18535253 download_size: 21961561999 dataset_size: 63581153419 - config_name: unshuffled_original_uk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 56439494556 num_examples: 12973467 download_size: 14419203733 dataset_size: 56439494556 - config_name: unshuffled_original_vi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 72226388484 num_examples: 14898250 download_size: 21503594095 dataset_size: 72226388484 - config_name: unshuffled_original_wuu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 114041 num_examples: 214 download_size: 8780 dataset_size: 114041 - config_name: unshuffled_original_yo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 58546 num_examples: 214 download_size: 9550 dataset_size: 58546 - config_name: unshuffled_original_zh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 545607539477 num_examples: 60137667 download_size: 206003993405 dataset_size: 545607539477 - config_name: unshuffled_deduplicated_en features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1297616499791 num_examples: 304230423 download_size: 496496144465 dataset_size: 1297616499791 - config_name: unshuffled_deduplicated_eu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 360674267 num_examples: 256513 download_size: 134683484 dataset_size: 360674267 - config_name: unshuffled_deduplicated_frr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4500 num_examples: 7 download_size: 540 dataset_size: 4500 - config_name: unshuffled_deduplicated_gl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 404922022 num_examples: 284320 download_size: 155851883 dataset_size: 404922022 - config_name: unshuffled_deduplicated_he features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10451408409 num_examples: 2375030 download_size: 3043383695 dataset_size: 10451408409 - config_name: unshuffled_deduplicated_ht features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3439 num_examples: 9 download_size: 594 dataset_size: 3439 - config_name: unshuffled_deduplicated_id features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 16964948727 num_examples: 9948521 download_size: 5995510660 dataset_size: 16964948727 - config_name: unshuffled_deduplicated_is features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 891047926 num_examples: 389515 download_size: 332871764 dataset_size: 891047926 - config_name: unshuffled_deduplicated_jv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 609713 num_examples: 1163 download_size: 208165 dataset_size: 609713 - config_name: unshuffled_deduplicated_kn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1080985653 num_examples: 251064 download_size: 215526836 dataset_size: 1080985653 - config_name: unshuffled_deduplicated_kv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1200609 num_examples: 924 download_size: 327479 dataset_size: 1200609 - config_name: unshuffled_deduplicated_lb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 21242773 num_examples: 21735 download_size: 8300328 dataset_size: 21242773 - config_name: unshuffled_deduplicated_lo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 119015146 num_examples: 32652 download_size: 23634237 dataset_size: 119015146 - config_name: unshuffled_deduplicated_mai features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 10721 num_examples: 25 download_size: 2267 dataset_size: 10721 - config_name: unshuffled_deduplicated_mk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1186605123 num_examples: 299457 download_size: 303118518 dataset_size: 1186605123 - config_name: unshuffled_deduplicated_mrj features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1096428 num_examples: 669 download_size: 289048 dataset_size: 1096428 - config_name: unshuffled_deduplicated_my features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1112006614 num_examples: 136639 download_size: 207136614 dataset_size: 1112006614 - config_name: unshuffled_deduplicated_nap features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 13782 num_examples: 55 download_size: 4965 dataset_size: 13782 - config_name: unshuffled_deduplicated_nl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 41726089054 num_examples: 20812149 download_size: 15734167112 dataset_size: 41726089054 - config_name: unshuffled_deduplicated_or features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 197401878 num_examples: 44230 download_size: 38726721 dataset_size: 197401878 - config_name: unshuffled_deduplicated_pl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 50387595763 num_examples: 20682611 download_size: 20189161328 dataset_size: 50387595763 - config_name: unshuffled_deduplicated_pt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 68162434231 num_examples: 26920397 download_size: 25997795946 dataset_size: 68162434231 - config_name: unshuffled_deduplicated_ru features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 611031071327 num_examples: 115954598 download_size: 166677136024 dataset_size: 611031071327 - config_name: unshuffled_deduplicated_sd features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 275327037 num_examples: 33925 download_size: 74169753 dataset_size: 275327037 - config_name: unshuffled_deduplicated_sl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1311219223 num_examples: 886223 download_size: 523218283 dataset_size: 1311219223 - config_name: unshuffled_deduplicated_su features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 149921 num_examples: 511 download_size: 53164 dataset_size: 149921 - config_name: unshuffled_deduplicated_te features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1694004428 num_examples: 312644 download_size: 342429224 dataset_size: 1694004428 - config_name: unshuffled_deduplicated_tl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 429427446 num_examples: 294132 download_size: 151342433 dataset_size: 429427446 - config_name: unshuffled_deduplicated_ug features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 86344782 num_examples: 15503 download_size: 20527752 dataset_size: 86344782 - config_name: unshuffled_deduplicated_vec features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17303 num_examples: 64 download_size: 7647 dataset_size: 17303 - config_name: unshuffled_deduplicated_war features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2338532 num_examples: 9161 download_size: 546586 dataset_size: 2338532 - config_name: unshuffled_deduplicated_yi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 87935052 num_examples: 32919 download_size: 22197718 dataset_size: 87935052 - config_name: unshuffled_original_af features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 254076274 num_examples: 201117 download_size: 85795254 dataset_size: 254076274 - config_name: unshuffled_original_ar features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 87935768938 num_examples: 16365602 download_size: 22232546836 dataset_size: 87935768938 - config_name: unshuffled_original_av features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 423603 num_examples: 456 download_size: 84767 dataset_size: 423603 - config_name: unshuffled_original_bar features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 555 num_examples: 4 download_size: 341 dataset_size: 555 - config_name: unshuffled_original_bh features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 116514 num_examples: 336 download_size: 7615 dataset_size: 116514 - config_name: unshuffled_original_br features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 30203875 num_examples: 37085 download_size: 9178158 dataset_size: 30203875 - config_name: unshuffled_original_cbk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 536 num_examples: 1 download_size: 234 dataset_size: 536 - config_name: unshuffled_original_cs features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 57080142860 num_examples: 21001388 download_size: 21716697253 dataset_size: 57080142860 - config_name: unshuffled_original_de features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 331224484023 num_examples: 104913504 download_size: 119506267566 dataset_size: 331224484023 - config_name: unshuffled_original_el features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 66273231642 num_examples: 10425596 download_size: 17309601342 dataset_size: 66273231642 - config_name: unshuffled_original_es features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 298492270636 num_examples: 88199221 download_size: 106039137656 dataset_size: 298492270636 - config_name: unshuffled_original_fi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 28571419204 num_examples: 8557453 download_size: 9970837279 dataset_size: 28571419204 - config_name: unshuffled_original_ga features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 92369035 num_examples: 83223 download_size: 29262282 dataset_size: 92369035 - config_name: unshuffled_original_gom features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2257169 num_examples: 640 download_size: 442950 dataset_size: 2257169 - config_name: unshuffled_original_hr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 243829069 num_examples: 582219 download_size: 79417804 dataset_size: 243829069 - config_name: unshuffled_original_hy features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3939672772 num_examples: 659430 download_size: 897364024 dataset_size: 3939672772 - config_name: unshuffled_original_ilo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 925809 num_examples: 2638 download_size: 267451 dataset_size: 925809 - config_name: unshuffled_original_ja features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 232216718556 num_examples: 62721527 download_size: 79564645083 dataset_size: 232216718556 - config_name: unshuffled_original_kk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2833778199 num_examples: 524591 download_size: 615067761 dataset_size: 2833778199 - config_name: unshuffled_original_krc features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2688672 num_examples: 1581 download_size: 656496 dataset_size: 2688672 - config_name: unshuffled_original_ky features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 630794622 num_examples: 146993 download_size: 152636608 dataset_size: 630794622 - config_name: unshuffled_original_li features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 31312 num_examples: 137 download_size: 11793 dataset_size: 31312 - config_name: unshuffled_original_lt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 9445278312 num_examples: 2977757 download_size: 3439789726 dataset_size: 9445278312 - config_name: unshuffled_original_mhr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 7553453 num_examples: 3212 download_size: 1834912 dataset_size: 7553453 - config_name: unshuffled_original_mn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2332897881 num_examples: 395605 download_size: 472357548 dataset_size: 2332897881 - config_name: unshuffled_original_mt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 24470330 num_examples: 26598 download_size: 7533204 dataset_size: 24470330 - config_name: unshuffled_original_mzn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 720229 num_examples: 1055 download_size: 177817 dataset_size: 720229 - config_name: unshuffled_original_ne features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1866852959 num_examples: 299938 download_size: 355291639 dataset_size: 1866852959 - config_name: unshuffled_original_no features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8652054976 num_examples: 5546211 download_size: 3106155643 dataset_size: 8652054976 - config_name: unshuffled_original_pa features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 801167879 num_examples: 127467 download_size: 164207256 dataset_size: 801167879 - config_name: unshuffled_original_pnb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 12039418 num_examples: 4599 download_size: 3215579 dataset_size: 12039418 - config_name: unshuffled_original_rm features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 8027 num_examples: 41 download_size: 2691 dataset_size: 8027 - config_name: unshuffled_original_sah features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 43817239 num_examples: 22301 download_size: 9079982 dataset_size: 43817239 - config_name: unshuffled_original_si features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1469374795 num_examples: 203082 download_size: 310935021 dataset_size: 1469374795 - config_name: unshuffled_original_sq features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2440834375 num_examples: 672077 download_size: 861831806 dataset_size: 2440834375 - config_name: unshuffled_original_sw features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 14073775 num_examples: 41986 download_size: 3712739 dataset_size: 14073775 - config_name: unshuffled_original_th features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 38289228753 num_examples: 6064129 download_size: 7377469078 dataset_size: 38289228753 - config_name: unshuffled_original_tt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 703412782 num_examples: 135923 download_size: 151056507 dataset_size: 703412782 - config_name: unshuffled_original_ur features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2802270961 num_examples: 638596 download_size: 712607161 dataset_size: 2802270961 - config_name: unshuffled_original_vo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2118909 num_examples: 3366 download_size: 307184 dataset_size: 2118909 - config_name: unshuffled_original_xal features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 116043 num_examples: 39 download_size: 32117 dataset_size: 116043 - config_name: unshuffled_original_yue features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3899 num_examples: 11 download_size: 647 dataset_size: 3899 - config_name: unshuffled_original_en features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2525437912097 num_examples: 455994980 download_size: 903830686146 dataset_size: 2525437912097 - config_name: unshuffled_original_eu features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 894836188 num_examples: 506883 download_size: 248190119 dataset_size: 894836188 - config_name: unshuffled_original_frr features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4507 num_examples: 7 download_size: 527 dataset_size: 4507 - config_name: unshuffled_original_gl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 656477422 num_examples: 544388 download_size: 235384299 dataset_size: 656477422 - config_name: unshuffled_original_he features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 21113706929 num_examples: 3808397 download_size: 5660026441 dataset_size: 21113706929 - config_name: unshuffled_original_ht features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 4083 num_examples: 13 download_size: 590 dataset_size: 4083 - config_name: unshuffled_original_id features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 32317679452 num_examples: 16236463 download_size: 10596988488 dataset_size: 32317679452 - config_name: unshuffled_original_is features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1524936467 num_examples: 625673 download_size: 533034495 dataset_size: 1524936467 - config_name: unshuffled_original_jv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 691812 num_examples: 1445 download_size: 219246 dataset_size: 691812 - config_name: unshuffled_original_kn features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1763625096 num_examples: 350363 download_size: 342155433 dataset_size: 1763625096 - config_name: unshuffled_original_kv features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2379758 num_examples: 1549 download_size: 400725 dataset_size: 2379758 - config_name: unshuffled_original_lb features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 30595156 num_examples: 34807 download_size: 10725552 dataset_size: 30595156 - config_name: unshuffled_original_lo features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 182361509 num_examples: 52910 download_size: 33916738 dataset_size: 182361509 - config_name: unshuffled_original_mai features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 325990 num_examples: 123 download_size: 5563 dataset_size: 325990 - config_name: unshuffled_original_mk features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2202480390 num_examples: 437871 download_size: 508239918 dataset_size: 2202480390 - config_name: unshuffled_original_mrj features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1165977 num_examples: 757 download_size: 303447 dataset_size: 1165977 - config_name: unshuffled_original_my features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2021872493 num_examples: 232329 download_size: 369850157 dataset_size: 2021872493 - config_name: unshuffled_original_nap features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17839 num_examples: 73 download_size: 5023 dataset_size: 17839 - config_name: unshuffled_original_nl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 83230965323 num_examples: 34682142 download_size: 29352811750 dataset_size: 83230965323 - config_name: unshuffled_original_or features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 260151226 num_examples: 59463 download_size: 49834443 dataset_size: 260151226 - config_name: unshuffled_original_pl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 117121370605 num_examples: 35440972 download_size: 42884898947 dataset_size: 117121370605 - config_name: unshuffled_original_pt features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 132635490139 num_examples: 42114520 download_size: 47257949300 dataset_size: 132635490139 - config_name: unshuffled_original_ru features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 1241627166551 num_examples: 161836003 download_size: 319755378587 dataset_size: 1241627166551 - config_name: unshuffled_original_sd features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 364256869 num_examples: 44280 download_size: 90621520 dataset_size: 364256869 - config_name: unshuffled_original_sl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2675665926 num_examples: 1746604 download_size: 956197026 dataset_size: 2675665926 - config_name: unshuffled_original_su features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 225627 num_examples: 805 download_size: 59643 dataset_size: 225627 - config_name: unshuffled_original_te features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2611548765 num_examples: 475703 download_size: 522470115 dataset_size: 2611548765 - config_name: unshuffled_original_tl features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 606295665 num_examples: 458206 download_size: 204895159 dataset_size: 606295665 - config_name: unshuffled_original_ug features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 127419368 num_examples: 22255 download_size: 27923925 dataset_size: 127419368 - config_name: unshuffled_original_vec features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 19182 num_examples: 73 download_size: 7672 dataset_size: 19182 - config_name: unshuffled_original_war features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 2682430 num_examples: 9760 download_size: 644576 dataset_size: 2682430 - config_name: unshuffled_original_yi features: - name: id dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 147601654 num_examples: 59364 download_size: 33337157 dataset_size: 147601654 config_names: - unshuffled_deduplicated_af - unshuffled_deduplicated_als - unshuffled_deduplicated_am - unshuffled_deduplicated_an - unshuffled_deduplicated_ar - unshuffled_deduplicated_arz - unshuffled_deduplicated_as - unshuffled_deduplicated_ast - unshuffled_deduplicated_av - unshuffled_deduplicated_az - unshuffled_deduplicated_azb - unshuffled_deduplicated_ba - unshuffled_deduplicated_bar - unshuffled_deduplicated_bcl - unshuffled_deduplicated_be - unshuffled_deduplicated_bg - unshuffled_deduplicated_bh - unshuffled_deduplicated_bn - unshuffled_deduplicated_bo - unshuffled_deduplicated_bpy - unshuffled_deduplicated_br - unshuffled_deduplicated_bs - unshuffled_deduplicated_bxr - unshuffled_deduplicated_ca - unshuffled_deduplicated_cbk - unshuffled_deduplicated_ce - unshuffled_deduplicated_ceb - unshuffled_deduplicated_ckb - unshuffled_deduplicated_cs - unshuffled_deduplicated_cv - unshuffled_deduplicated_cy - unshuffled_deduplicated_da - unshuffled_deduplicated_de - unshuffled_deduplicated_diq - unshuffled_deduplicated_dsb - unshuffled_deduplicated_dv - unshuffled_deduplicated_el - unshuffled_deduplicated_eml - unshuffled_deduplicated_en - unshuffled_deduplicated_eo - unshuffled_deduplicated_es - unshuffled_deduplicated_et - unshuffled_deduplicated_eu - unshuffled_deduplicated_fa - unshuffled_deduplicated_fi - unshuffled_deduplicated_fr - unshuffled_deduplicated_frr - unshuffled_deduplicated_fy - unshuffled_deduplicated_ga - unshuffled_deduplicated_gd - unshuffled_deduplicated_gl - unshuffled_deduplicated_gn - unshuffled_deduplicated_gom - unshuffled_deduplicated_gu - unshuffled_deduplicated_he - unshuffled_deduplicated_hi - unshuffled_deduplicated_hr - unshuffled_deduplicated_hsb - unshuffled_deduplicated_ht - unshuffled_deduplicated_hu - unshuffled_deduplicated_hy - unshuffled_deduplicated_ia - unshuffled_deduplicated_id - unshuffled_deduplicated_ie - unshuffled_deduplicated_ilo - unshuffled_deduplicated_io - unshuffled_deduplicated_is - unshuffled_deduplicated_it - unshuffled_deduplicated_ja - unshuffled_deduplicated_jbo - unshuffled_deduplicated_jv - unshuffled_deduplicated_ka - unshuffled_deduplicated_kk - unshuffled_deduplicated_km - unshuffled_deduplicated_kn - unshuffled_deduplicated_ko - unshuffled_deduplicated_krc - unshuffled_deduplicated_ku - unshuffled_deduplicated_kv - unshuffled_deduplicated_kw - unshuffled_deduplicated_ky - unshuffled_deduplicated_la - unshuffled_deduplicated_lb - unshuffled_deduplicated_lez - unshuffled_deduplicated_li - unshuffled_deduplicated_lmo - unshuffled_deduplicated_lo - unshuffled_deduplicated_lrc - unshuffled_deduplicated_lt - unshuffled_deduplicated_lv - unshuffled_deduplicated_mai - unshuffled_deduplicated_mg - unshuffled_deduplicated_mhr - unshuffled_deduplicated_min - unshuffled_deduplicated_mk - unshuffled_deduplicated_ml - unshuffled_deduplicated_mn - unshuffled_deduplicated_mr - unshuffled_deduplicated_mrj - unshuffled_deduplicated_ms - unshuffled_deduplicated_mt - unshuffled_deduplicated_mwl - unshuffled_deduplicated_my - unshuffled_deduplicated_myv - unshuffled_deduplicated_mzn - unshuffled_deduplicated_nah - unshuffled_deduplicated_nap - unshuffled_deduplicated_nds - unshuffled_deduplicated_ne - unshuffled_deduplicated_new - unshuffled_deduplicated_nl - unshuffled_deduplicated_nn - unshuffled_deduplicated_no - unshuffled_deduplicated_oc - unshuffled_deduplicated_or - unshuffled_deduplicated_os - unshuffled_deduplicated_pa - unshuffled_deduplicated_pam - unshuffled_deduplicated_pl - unshuffled_deduplicated_pms - unshuffled_deduplicated_pnb - unshuffled_deduplicated_ps - unshuffled_deduplicated_pt - unshuffled_deduplicated_qu - unshuffled_deduplicated_rm - unshuffled_deduplicated_ro - unshuffled_deduplicated_ru - unshuffled_deduplicated_sa - unshuffled_deduplicated_sah - unshuffled_deduplicated_scn - unshuffled_deduplicated_sd - unshuffled_deduplicated_sh - unshuffled_deduplicated_si - unshuffled_deduplicated_sk - unshuffled_deduplicated_sl - unshuffled_deduplicated_so - unshuffled_deduplicated_sq - unshuffled_deduplicated_sr - unshuffled_deduplicated_su - unshuffled_deduplicated_sv - unshuffled_deduplicated_sw - unshuffled_deduplicated_ta - unshuffled_deduplicated_te - unshuffled_deduplicated_tg - unshuffled_deduplicated_th - unshuffled_deduplicated_tk - unshuffled_deduplicated_tl - unshuffled_deduplicated_tr - unshuffled_deduplicated_tt - unshuffled_deduplicated_tyv - unshuffled_deduplicated_ug - unshuffled_deduplicated_uk - unshuffled_deduplicated_ur - unshuffled_deduplicated_uz - unshuffled_deduplicated_vec - unshuffled_deduplicated_vi - unshuffled_deduplicated_vo - unshuffled_deduplicated_wa - unshuffled_deduplicated_war - unshuffled_deduplicated_wuu - unshuffled_deduplicated_xal - unshuffled_deduplicated_xmf - unshuffled_deduplicated_yi - unshuffled_deduplicated_yo - unshuffled_deduplicated_yue - unshuffled_deduplicated_zh - unshuffled_original_af - unshuffled_original_als - unshuffled_original_am - unshuffled_original_an - unshuffled_original_ar - unshuffled_original_arz - unshuffled_original_as - unshuffled_original_ast - unshuffled_original_av - unshuffled_original_az - unshuffled_original_azb - unshuffled_original_ba - unshuffled_original_bar - unshuffled_original_bcl - unshuffled_original_be - unshuffled_original_bg - unshuffled_original_bh - unshuffled_original_bn - unshuffled_original_bo - unshuffled_original_bpy - unshuffled_original_br - unshuffled_original_bs - unshuffled_original_bxr - unshuffled_original_ca - unshuffled_original_cbk - unshuffled_original_ce - unshuffled_original_ceb - unshuffled_original_ckb - unshuffled_original_cs - unshuffled_original_cv - unshuffled_original_cy - unshuffled_original_da - unshuffled_original_de - unshuffled_original_diq - unshuffled_original_dsb - unshuffled_original_dv - unshuffled_original_el - unshuffled_original_eml - unshuffled_original_en - unshuffled_original_eo - unshuffled_original_es - unshuffled_original_et - unshuffled_original_eu - unshuffled_original_fa - unshuffled_original_fi - unshuffled_original_fr - unshuffled_original_frr - unshuffled_original_fy - unshuffled_original_ga - unshuffled_original_gd - unshuffled_original_gl - unshuffled_original_gn - unshuffled_original_gom - unshuffled_original_gu - unshuffled_original_he - unshuffled_original_hi - unshuffled_original_hr - unshuffled_original_hsb - unshuffled_original_ht - unshuffled_original_hu - unshuffled_original_hy - unshuffled_original_ia - unshuffled_original_id - unshuffled_original_ie - unshuffled_original_ilo - unshuffled_original_io - unshuffled_original_is - unshuffled_original_it - unshuffled_original_ja - unshuffled_original_jbo - unshuffled_original_jv - unshuffled_original_ka - unshuffled_original_kk - unshuffled_original_km - unshuffled_original_kn - unshuffled_original_ko - unshuffled_original_krc - unshuffled_original_ku - unshuffled_original_kv - unshuffled_original_kw - unshuffled_original_ky - unshuffled_original_la - unshuffled_original_lb - unshuffled_original_lez - unshuffled_original_li - unshuffled_original_lmo - unshuffled_original_lo - unshuffled_original_lrc - unshuffled_original_lt - unshuffled_original_lv - unshuffled_original_mai - unshuffled_original_mg - unshuffled_original_mhr - unshuffled_original_min - unshuffled_original_mk - unshuffled_original_ml - unshuffled_original_mn - unshuffled_original_mr - unshuffled_original_mrj - unshuffled_original_ms - unshuffled_original_mt - unshuffled_original_mwl - unshuffled_original_my - unshuffled_original_myv - unshuffled_original_mzn - unshuffled_original_nah - unshuffled_original_nap - unshuffled_original_nds - unshuffled_original_ne - unshuffled_original_new - unshuffled_original_nl - unshuffled_original_nn - unshuffled_original_no - unshuffled_original_oc - unshuffled_original_or - unshuffled_original_os - unshuffled_original_pa - unshuffled_original_pam - unshuffled_original_pl - unshuffled_original_pms - unshuffled_original_pnb - unshuffled_original_ps - unshuffled_original_pt - unshuffled_original_qu - unshuffled_original_rm - unshuffled_original_ro - unshuffled_original_ru - unshuffled_original_sa - unshuffled_original_sah - unshuffled_original_scn - unshuffled_original_sd - unshuffled_original_sh - unshuffled_original_si - unshuffled_original_sk - unshuffled_original_sl - unshuffled_original_so - unshuffled_original_sq - unshuffled_original_sr - unshuffled_original_su - unshuffled_original_sv - unshuffled_original_sw - unshuffled_original_ta - unshuffled_original_te - unshuffled_original_tg - unshuffled_original_th - unshuffled_original_tk - unshuffled_original_tl - unshuffled_original_tr - unshuffled_original_tt - unshuffled_original_tyv - unshuffled_original_ug - unshuffled_original_uk - unshuffled_original_ur - unshuffled_original_uz - unshuffled_original_vec - unshuffled_original_vi - unshuffled_original_vo - unshuffled_original_wa - unshuffled_original_war - unshuffled_original_wuu - unshuffled_original_xal - unshuffled_original_xmf - unshuffled_original_yi - unshuffled_original_yo - unshuffled_original_yue - unshuffled_original_zh --- # Dataset Card for "oscar" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://oscar-corpus.com](https://oscar-corpus.com) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary OSCAR or **O**pen **S**uper-large **C**rawled [**A**LMAnaCH](https://team.inria.fr/almanach/) co**R**pus is a huge multilingual corpus obtained by language classification and filtering of the [Common Crawl](https://commoncrawl.org/) corpus using the [goclassy](https://github.com/pjox/goclassy) architecture. Data is distributed by language in both original and deduplicated form. The version here is the original OSCAR 2019 release: https://oscar-project.org/post/oscar-2019/ For more recent versions, visit the [oscar-corpus](https://huggingface.co/oscar-corpus) organization on the Hub: - OSCAR 22.01 (released in January 2022): [oscar-corpus/OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) - OSCAR 21.09 (released in September 2021): [oscar-corpus/OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) ### Supported Tasks and Leaderboards OSCAR is mainly inteded to pretrain language models and word represantations. ### Languages All the data is distributed by language, both the original and the deduplicated versions of the data are available. 166 different languages are available. The table in subsection [Data Splits Sample Size](#data-splits-sample-size) provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR. ## Dataset Structure We show detailed information for all the configurations of the dataset. ### Data Instances <details> <summary>Click to expand the Data/size information for each language (deduplicated)</summary> #### unshuffled_deduplicated_af - **Size of downloaded dataset files:** 65.99 MB - **Size of the generated dataset:** 172.30 MB - **Total amount of disk used:** 238.29 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel" } ``` #### unshuffled_deduplicated_als - **Size of downloaded dataset files:** 1.26 MB - **Size of the generated dataset:** 2.96 MB - **Total amount of disk used:** 4.22 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..." } ``` #### unshuffled_deduplicated_am - **Size of downloaded dataset files:** 61.35 MB - **Size of the generated dataset:** 216.15 MB - **Total amount of disk used:** 277.50 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..." } ``` #### unshuffled_deduplicated_an - **Size of downloaded dataset files:** 0.14 MB - **Size of the generated dataset:** 0.85 MB - **Total amount of disk used:** 0.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..." } ``` #### unshuffled_deduplicated_ar - **Size of downloaded dataset files:** 9.67 GB - **Size of the generated dataset:** 33.57 GB - **Total amount of disk used:** 43.23 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..." } ``` #### unshuffled_deduplicated_arz - **Size of downloaded dataset files:** 10.02 MB - **Size of the generated dataset:** 35.91 MB - **Total amount of disk used:** 45.94 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..." } ``` #### unshuffled_deduplicated_as - **Size of downloaded dataset files:** 15.51 MB - **Size of the generated dataset:** 74.07 MB - **Total amount of disk used:** 89.58 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..." } ``` #### unshuffled_deduplicated_ast - **Size of downloaded dataset files:** 0.86 MB - **Size of the generated dataset:** 2.17 MB - **Total amount of disk used:** 3.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..." } ``` #### unshuffled_deduplicated_av - **Size of downloaded dataset files:** 0.07 MB - **Size of the generated dataset:** 0.34 MB - **Total amount of disk used:** 0.41 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..." } ``` #### unshuffled_deduplicated_az - **Size of downloaded dataset files:** 521.74 MB - **Size of the generated dataset:** 1.53 GB - **Total amount of disk used:** 2.05 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..." } ``` #### unshuffled_deduplicated_azb - **Size of downloaded dataset files:** 5.19 MB - **Size of the generated dataset:** 20.08 MB - **Total amount of disk used:** 25.27 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..." } ``` #### unshuffled_deduplicated_ba - **Size of downloaded dataset files:** 25.98 MB - **Size of the generated dataset:** 93.84 MB - **Total amount of disk used:** 119.82 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..." } ``` #### unshuffled_deduplicated_bar - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": " vo" } ``` #### unshuffled_deduplicated_bcl - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..." } ``` #### unshuffled_deduplicated_be - **Size of downloaded dataset files:** 306.70 MB - **Size of the generated dataset:** 1.08 GB - **Total amount of disk used:** 1.39 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..." } ``` #### unshuffled_deduplicated_bg - **Size of downloaded dataset files:** 3.85 GB - **Size of the generated dataset:** 14.45 GB - **Total amount of disk used:** 18.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..." } ``` #### unshuffled_deduplicated_bh - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.04 MB - **Total amount of disk used:** 0.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..." } ``` #### unshuffled_deduplicated_bn - **Size of downloaded dataset files:** 1.26 GB - **Size of the generated dataset:** 6.24 GB - **Total amount of disk used:** 7.50 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nTagged with ডায়োজিনি..." } ``` #### unshuffled_deduplicated_bo - **Size of downloaded dataset files:** 22.37 MB - **Size of the generated dataset:** 144.65 MB - **Total amount of disk used:** 167.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..." } ``` #### unshuffled_deduplicated_bpy - **Size of downloaded dataset files:** 0.19 MB - **Size of the generated dataset:** 1.78 MB - **Total amount of disk used:** 1.97 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..." } ``` #### unshuffled_deduplicated_br - **Size of downloaded dataset files:** 6.47 MB - **Size of the generated dataset:** 17.00 MB - **Total amount of disk used:** 23.47 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..." } ``` #### unshuffled_deduplicated_bs - **Size of downloaded dataset files:** 0.04 MB - **Size of the generated dataset:** 0.15 MB - **Total amount of disk used:** 0.18 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..." } ``` #### unshuffled_deduplicated_bxr - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..." } ``` #### unshuffled_deduplicated_ca - **Size of downloaded dataset files:** 1.73 GB - **Size of the generated dataset:** 4.57 GB - **Total amount of disk used:** 6.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..." } ``` #### unshuffled_deduplicated_cbk - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..." } ``` #### unshuffled_deduplicated_ce - **Size of downloaded dataset files:** 1.87 MB - **Size of the generated dataset:** 7.04 MB - **Total amount of disk used:** 8.90 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..." } ``` #### unshuffled_deduplicated_ceb - **Size of downloaded dataset files:** 7.12 MB - **Size of the generated dataset:** 24.83 MB - **Total amount of disk used:** 31.95 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..." } ``` #### unshuffled_deduplicated_ckb - **Size of downloaded dataset files:** 60.32 MB - **Size of the generated dataset:** 237.72 MB - **Total amount of disk used:** 298.05 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..." } ``` #### unshuffled_deduplicated_cs - **Size of downloaded dataset files:** 10.49 GB - **Size of the generated dataset:** 25.71 GB - **Total amount of disk used:** 36.20 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..." } ``` #### unshuffled_deduplicated_cv - **Size of downloaded dataset files:** 7.47 MB - **Size of the generated dataset:** 27.49 MB - **Total amount of disk used:** 34.95 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..." } ``` #### unshuffled_deduplicated_cy - **Size of downloaded dataset files:** 53.63 MB - **Size of the generated dataset:** 141.22 MB - **Total amount of disk used:** 194.86 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..." } ``` #### unshuffled_deduplicated_da - **Size of downloaded dataset files:** 3.82 GB - **Size of the generated dataset:** 10.24 GB - **Total amount of disk used:** 14.06 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..." } ``` #### unshuffled_deduplicated_de - **Size of downloaded dataset files:** 60.80 GB - **Size of the generated dataset:** 156.30 GB - **Total amount of disk used:** 217.10 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..." } ``` #### unshuffled_deduplicated_diq - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:" } ``` #### unshuffled_deduplicated_dsb - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana." } ``` #### unshuffled_deduplicated_dv - **Size of downloaded dataset files:** 16.84 MB - **Size of the generated dataset:** 82.19 MB - **Total amount of disk used:** 99.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..." } ``` #### unshuffled_deduplicated_el - **Size of downloaded dataset files:** 7.91 GB - **Size of the generated dataset:** 28.74 GB - **Total amount of disk used:** 36.65 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..." } ``` #### unshuffled_deduplicated_eml - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..." } ``` #### unshuffled_deduplicated_en - **Size of downloaded dataset files:** 496.50 GB - **Size of the generated dataset:** 1299.75 GB - **Total amount of disk used:** 1796.24 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..." } ``` #### unshuffled_deduplicated_eo - **Size of downloaded dataset files:** 92.86 MB - **Size of the generated dataset:** 240.12 MB - **Total amount of disk used:** 332.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..." } ``` #### unshuffled_deduplicated_es - **Size of downloaded dataset files:** 60.46 GB - **Size of the generated dataset:** 160.86 GB - **Total amount of disk used:** 221.32 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..." } ``` #### unshuffled_deduplicated_et - **Size of downloaded dataset files:** 966.79 MB - **Size of the generated dataset:** 2.45 GB - **Total amount of disk used:** 3.41 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..." } ``` #### unshuffled_deduplicated_eu - **Size of downloaded dataset files:** 134.68 MB - **Size of the generated dataset:** 363.93 MB - **Total amount of disk used:** 498.61 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko." } ``` #### unshuffled_deduplicated_fa - **Size of downloaded dataset files:** 10.46 GB - **Size of the generated dataset:** 40.06 GB - **Total amount of disk used:** 50.52 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..." } ``` #### unshuffled_deduplicated_fi - **Size of downloaded dataset files:** 5.38 GB - **Size of the generated dataset:** 13.99 GB - **Total amount of disk used:** 19.37 GB An example of 'train' looks as follows. ``` { "id": 1, "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..." } ``` #### unshuffled_deduplicated_fr - **Size of downloaded dataset files:** 55.46 GB - **Size of the generated dataset:** 148.28 GB - **Total amount of disk used:** 203.75 GB An example of 'train' looks as follows. ``` { "id": 0, "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french" } ``` #### unshuffled_deduplicated_frr - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..." } ``` #### unshuffled_deduplicated_fy - **Size of downloaded dataset files:** 10.27 MB - **Size of the generated dataset:** 26.73 MB - **Total amount of disk used:** 37.00 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje." } ``` #### unshuffled_deduplicated_ga - **Size of downloaded dataset files:** 22.22 MB - **Size of the generated dataset:** 63.86 MB - **Total amount of disk used:** 86.08 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..." } ``` #### unshuffled_deduplicated_gd - **Size of downloaded dataset files:** 0.42 MB - **Size of the generated dataset:** 1.36 MB - **Total amount of disk used:** 1.78 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017." } ``` #### unshuffled_deduplicated_gl - **Size of downloaded dataset files:** 155.85 MB - **Size of the generated dataset:** 408.34 MB - **Total amount of disk used:** 564.19 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..." } ``` #### unshuffled_deduplicated_gn - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"º ѐÆÚÓ À Ã Ð É Æ ¾ Ä ΠÀ ¼ Æ É ÄÛ = Ü Ý\\\"Þ ß†à á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..." } ``` #### unshuffled_deduplicated_gom - **Size of downloaded dataset files:** 0.38 MB - **Size of the generated dataset:** 1.87 MB - **Total amount of disk used:** 2.24 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..." } ``` #### unshuffled_deduplicated_gu - **Size of downloaded dataset files:** 162.97 MB - **Size of the generated dataset:** 759.34 MB - **Total amount of disk used:** 922.32 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..." } ``` #### unshuffled_deduplicated_he - **Size of downloaded dataset files:** 3.04 GB - **Size of the generated dataset:** 10.47 GB - **Total amount of disk used:** 13.51 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..." } ``` #### unshuffled_deduplicated_hi - **Size of downloaded dataset files:** 2.01 GB - **Size of the generated dataset:** 9.57 GB - **Total amount of disk used:** 11.58 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्‍सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..." } ``` #### unshuffled_deduplicated_hr - **Size of downloaded dataset files:** 46.74 MB - **Size of the generated dataset:** 121.50 MB - **Total amount of disk used:** 168.23 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..." } ``` #### unshuffled_deduplicated_hsb - **Size of downloaded dataset files:** 0.72 MB - **Size of the generated dataset:** 1.89 MB - **Total amount of disk used:** 2.61 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..." } ``` #### unshuffled_deduplicated_ht - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..." } ``` #### unshuffled_deduplicated_hu - **Size of downloaded dataset files:** 7.37 GB - **Size of the generated dataset:** 19.09 GB - **Total amount of disk used:** 26.46 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..." } ``` #### unshuffled_deduplicated_hy - **Size of downloaded dataset files:** 393.62 MB - **Size of the generated dataset:** 1.56 GB - **Total amount of disk used:** 1.96 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..." } ``` #### unshuffled_deduplicated_ia - **Size of downloaded dataset files:** 0.05 MB - **Size of the generated dataset:** 0.38 MB - **Total amount of disk used:** 0.43 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..." } ``` #### unshuffled_deduplicated_id - **Size of downloaded dataset files:** 6.00 GB - **Size of the generated dataset:** 17.05 GB - **Total amount of disk used:** 23.05 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..." } ``` #### unshuffled_deduplicated_ie - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo" } ``` #### unshuffled_deduplicated_ilo - **Size of downloaded dataset files:** 0.23 MB - **Size of the generated dataset:** 0.68 MB - **Total amount of disk used:** 0.91 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..." } ``` #### unshuffled_deduplicated_io - **Size of downloaded dataset files:** 0.04 MB - **Size of the generated dataset:** 0.14 MB - **Total amount of disk used:** 0.19 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..." } ``` #### unshuffled_deduplicated_is - **Size of downloaded dataset files:** 332.87 MB - **Size of the generated dataset:** 894.28 MB - **Total amount of disk used:** 1.23 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..." } ``` #### unshuffled_deduplicated_it - **Size of downloaded dataset files:** 27.93 GB - **Size of the generated dataset:** 74.09 GB - **Total amount of disk used:** 102.03 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..." } ``` #### unshuffled_deduplicated_ja - **Size of downloaded dataset files:** 40.80 GB - **Size of the generated dataset:** 113.63 GB - **Total amount of disk used:** 154.44 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..." } ``` #### unshuffled_deduplicated_jbo - **Size of downloaded dataset files:** 0.20 MB - **Size of the generated dataset:** 0.70 MB - **Total amount of disk used:** 0.91 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei" } ``` #### unshuffled_deduplicated_jv - **Size of downloaded dataset files:** 0.21 MB - **Size of the generated dataset:** 0.62 MB - **Total amount of disk used:** 0.82 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..." } ``` #### unshuffled_deduplicated_ka - **Size of downloaded dataset files:** 377.23 MB - **Size of the generated dataset:** 1.99 GB - **Total amount of disk used:** 2.36 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..." } ``` #### unshuffled_deduplicated_kk - **Size of downloaded dataset files:** 389.12 MB - **Size of the generated dataset:** 1.59 GB - **Total amount of disk used:** 1.97 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..." } ``` #### unshuffled_deduplicated_km - **Size of downloaded dataset files:** 114.48 MB - **Size of the generated dataset:** 610.61 MB - **Total amount of disk used:** 725.09 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..." } ``` #### unshuffled_deduplicated_kn - **Size of downloaded dataset files:** 215.52 MB - **Size of the generated dataset:** 1.08 GB - **Total amount of disk used:** 1.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..." } ``` #### unshuffled_deduplicated_ko - **Size of downloaded dataset files:** 4.46 GB - **Size of the generated dataset:** 12.00 GB - **Total amount of disk used:** 16.47 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..." } ``` #### unshuffled_deduplicated_krc - **Size of downloaded dataset files:** 0.62 MB - **Size of the generated dataset:** 2.41 MB - **Total amount of disk used:** 3.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..." } ``` #### unshuffled_deduplicated_ku - **Size of downloaded dataset files:** 23.34 MB - **Size of the generated dataset:** 63.09 MB - **Total amount of disk used:** 86.43 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..." } ``` #### unshuffled_deduplicated_kv - **Size of downloaded dataset files:** 0.33 MB - **Size of the generated dataset:** 1.21 MB - **Total amount of disk used:** 1.54 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..." } ``` #### unshuffled_deduplicated_kw - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..." } ``` #### unshuffled_deduplicated_ky - **Size of downloaded dataset files:** 106.22 MB - **Size of the generated dataset:** 408.40 MB - **Total amount of disk used:** 514.61 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..." } ``` #### unshuffled_deduplicated_la - **Size of downloaded dataset files:** 3.42 MB - **Size of the generated dataset:** 9.79 MB - **Total amount of disk used:** 13.22 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..." } ``` #### unshuffled_deduplicated_lb - **Size of downloaded dataset files:** 8.30 MB - **Size of the generated dataset:** 21.42 MB - **Total amount of disk used:** 29.72 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..." } ``` #### unshuffled_deduplicated_lez - **Size of downloaded dataset files:** 0.77 MB - **Size of the generated dataset:** 3.08 MB - **Total amount of disk used:** 3.84 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..." } ``` #### unshuffled_deduplicated_li - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.03 MB - **Total amount of disk used:** 0.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..." } ``` #### unshuffled_deduplicated_lmo - **Size of downloaded dataset files:** 0.10 MB - **Size of the generated dataset:** 0.46 MB - **Total amount of disk used:** 0.57 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..." } ``` #### unshuffled_deduplicated_lo - **Size of downloaded dataset files:** 23.63 MB - **Size of the generated dataset:** 119.29 MB - **Total amount of disk used:** 142.92 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..." } ``` #### unshuffled_deduplicated_lrc - **Size of downloaded dataset files:** 0.02 MB - **Size of the generated dataset:** 0.06 MB - **Total amount of disk used:** 0.08 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..." } ``` #### unshuffled_deduplicated_lt - **Size of downloaded dataset files:** 1.65 GB - **Size of the generated dataset:** 4.20 GB - **Total amount of disk used:** 5.86 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..." } ``` #### unshuffled_deduplicated_lv - **Size of downloaded dataset files:** 710.45 MB - **Size of the generated dataset:** 1.91 GB - **Total amount of disk used:** 2.62 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..." } ``` #### unshuffled_deduplicated_mai - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..." } ``` #### unshuffled_deduplicated_mg - **Size of downloaded dataset files:** 4.30 MB - **Size of the generated dataset:** 13.59 MB - **Total amount of disk used:** 17.89 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..." } ``` #### unshuffled_deduplicated_mhr - **Size of downloaded dataset files:** 1.63 MB - **Size of the generated dataset:** 6.26 MB - **Total amount of disk used:** 7.89 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..." } ``` #### unshuffled_deduplicated_min - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.31 MB - **Total amount of disk used:** 0.33 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ..." } ``` #### unshuffled_deduplicated_mk - **Size of downloaded dataset files:** 303.12 MB - **Size of the generated dataset:** 1.19 GB - **Total amount of disk used:** 1.49 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..." } ``` #### unshuffled_deduplicated_ml - **Size of downloaded dataset files:** 496.80 MB - **Size of the generated dataset:** 2.69 GB - **Total amount of disk used:** 3.18 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"സ്ത്രീ പ്രവേശനം സര്‍ക്കാര്‍ പൂര്‍ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില്‍ ഇടപെടുമെന്നും സര്‍ക്കാര്‍ ഹൈക്കോടതിയില്‍\\..." } ``` #### unshuffled_deduplicated_mn - **Size of downloaded dataset files:** 219.52 MB - **Size of the generated dataset:** 883.46 MB - **Total amount of disk used:** 1.10 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"МУБИС-ын багш мэргэжлийн хөрвөх сургалтыг төгссөн багшид багшлах эрх олгох тухай ~ БМДИ-ийн захирлын тушаал - Багшийн мэргэжил ..." } ``` #### unshuffled_deduplicated_mr - **Size of downloaded dataset files:** 299.68 MB - **Size of the generated dataset:** 1.49 GB - **Total amount of disk used:** 1.79 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..." } ``` #### unshuffled_deduplicated_mrj - **Size of downloaded dataset files:** 0.29 MB - **Size of the generated dataset:** 1.10 MB - **Total amount of disk used:** 1.38 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..." } ``` #### unshuffled_deduplicated_ms - **Size of downloaded dataset files:** 16.39 MB - **Size of the generated dataset:** 49.45 MB - **Total amount of disk used:** 65.85 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..." } ``` #### unshuffled_deduplicated_mt - **Size of downloaded dataset files:** 5.90 MB - **Size of the generated dataset:** 17.68 MB - **Total amount of disk used:** 23.58 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;" } ``` #### unshuffled_deduplicated_mwl - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..." } ``` #### unshuffled_deduplicated_my - **Size of downloaded dataset files:** 207.14 MB - **Size of the generated dataset:** 1.11 GB - **Total amount of disk used:** 1.32 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..." } ``` #### unshuffled_deduplicated_myv - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..." } ``` #### unshuffled_deduplicated_mzn - **Size of downloaded dataset files:** 0.16 MB - **Size of the generated dataset:** 0.63 MB - **Total amount of disk used:** 0.79 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنی‌یه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..." } ``` #### unshuffled_deduplicated_nah - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl." } ``` #### unshuffled_deduplicated_nap - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..." } ``` #### unshuffled_deduplicated_nds - **Size of downloaded dataset files:** 5.27 MB - **Size of the generated dataset:** 13.48 MB - **Total amount of disk used:** 18.76 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..." } ``` #### unshuffled_deduplicated_ne - **Size of downloaded dataset files:** 240.63 MB - **Size of the generated dataset:** 1.24 GB - **Total amount of disk used:** 1.48 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..." } ``` #### unshuffled_deduplicated_new - **Size of downloaded dataset files:** 0.83 MB - **Size of the generated dataset:** 4.26 MB - **Total amount of disk used:** 5.09 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..." } ``` #### unshuffled_deduplicated_nl - **Size of downloaded dataset files:** 15.73 GB - **Size of the generated dataset:** 41.91 GB - **Total amount of disk used:** 57.65 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..." } ``` #### unshuffled_deduplicated_nn - **Size of downloaded dataset files:** 23.58 MB - **Size of the generated dataset:** 58.32 MB - **Total amount of disk used:** 81.90 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag" } ``` #### unshuffled_deduplicated_no - **Size of downloaded dataset files:** 1.96 GB - **Size of the generated dataset:** 5.11 GB - **Total amount of disk used:** 7.07 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..." } ``` #### unshuffled_deduplicated_oc - **Size of downloaded dataset files:** 1.34 MB - **Size of the generated dataset:** 4.00 MB - **Total amount of disk used:** 5.34 MB An example of 'train' looks as follows. ``` { "id": 1, "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru." } ``` #### unshuffled_deduplicated_or - **Size of downloaded dataset files:** 38.72 MB - **Size of the generated dataset:** 197.63 MB - **Total amount of disk used:** 236.36 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..." } ``` #### unshuffled_deduplicated_os - **Size of downloaded dataset files:** 2.83 MB - **Size of the generated dataset:** 11.00 MB - **Total amount of disk used:** 13.83 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..." } ``` #### unshuffled_deduplicated_pa - **Size of downloaded dataset files:** 102.39 MB - **Size of the generated dataset:** 483.04 MB - **Total amount of disk used:** 585.42 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..." } ``` #### unshuffled_deduplicated_pam - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..." } ``` #### unshuffled_deduplicated_pl - **Size of downloaded dataset files:** 20.19 GB - **Size of the generated dataset:** 50.59 GB - **Total amount of disk used:** 70.78 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..." } ``` #### unshuffled_deduplicated_pms - **Size of downloaded dataset files:** 0.71 MB - **Size of the generated dataset:** 2.00 MB - **Total amount of disk used:** 2.72 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..." } ``` #### unshuffled_deduplicated_pnb - **Size of downloaded dataset files:** 2.58 MB - **Size of the generated dataset:** 9.44 MB - **Total amount of disk used:** 12.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..." } ``` #### unshuffled_deduplicated_ps - **Size of downloaded dataset files:** 71.83 MB - **Size of the generated dataset:** 254.79 MB - **Total amount of disk used:** 326.61 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..." } ``` #### unshuffled_deduplicated_pt - **Size of downloaded dataset files:** 26.00 GB - **Size of the generated dataset:** 68.37 GB - **Total amount of disk used:** 94.37 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..." } ``` #### unshuffled_deduplicated_qu - **Size of downloaded dataset files:** 0.02 MB - **Size of the generated dataset:** 0.07 MB - **Total amount of disk used:** 0.09 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi." } ``` #### unshuffled_deduplicated_rm - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..." } ``` #### unshuffled_deduplicated_ro - **Size of downloaded dataset files:** 4.48 GB - **Size of the generated dataset:** 11.66 GB - **Total amount of disk used:** 16.14 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..." } ``` #### unshuffled_deduplicated_ru - **Size of downloaded dataset files:** 166.68 GB - **Size of the generated dataset:** 611.70 GB - **Total amount of disk used:** 778.38 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..." } ``` #### unshuffled_deduplicated_sa - **Size of downloaded dataset files:** 7.27 MB - **Size of the generated dataset:** 38.33 MB - **Total amount of disk used:** 45.60 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्‍प्रति समाप्‍ता अस्ति । तस्‍य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..." } ``` #### unshuffled_deduplicated_sah - **Size of downloaded dataset files:** 7.01 MB - **Size of the generated dataset:** 27.46 MB - **Total amount of disk used:** 34.49 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..." } ``` #### unshuffled_deduplicated_scn - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati." } ``` #### unshuffled_deduplicated_sd - **Size of downloaded dataset files:** 74.17 MB - **Size of the generated dataset:** 275.48 MB - **Total amount of disk used:** 349.66 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..." } ``` #### unshuffled_deduplicated_sh - **Size of downloaded dataset files:** 1.45 MB - **Size of the generated dataset:** 6.44 MB - **Total amount of disk used:** 7.87 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..." } ``` #### unshuffled_deduplicated_si - **Size of downloaded dataset files:** 175.62 MB - **Size of the generated dataset:** 842.57 MB - **Total amount of disk used:** 1.02 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..." } ``` #### unshuffled_deduplicated_sk - **Size of downloaded dataset files:** 1.96 GB - **Size of the generated dataset:** 4.80 GB - **Total amount of disk used:** 6.76 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..." } ``` #### unshuffled_deduplicated_sl - **Size of downloaded dataset files:** 523.22 MB - **Size of the generated dataset:** 1.32 GB - **Total amount of disk used:** 1.85 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..." } ``` #### unshuffled_deduplicated_so - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..." } ``` #### unshuffled_deduplicated_sq - **Size of downloaded dataset files:** 445.36 MB - **Size of the generated dataset:** 1.21 GB - **Total amount of disk used:** 1.66 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..." } ``` #### unshuffled_deduplicated_sr - **Size of downloaded dataset files:** 665.03 MB - **Size of the generated dataset:** 2.36 GB - **Total amount of disk used:** 3.03 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..." } ``` #### unshuffled_deduplicated_su - **Size of downloaded dataset files:** 0.05 MB - **Size of the generated dataset:** 0.16 MB - **Total amount of disk used:** 0.21 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]" } ``` #### unshuffled_deduplicated_sv - **Size of downloaded dataset files:** 10.19 GB - **Size of the generated dataset:** 26.33 GB - **Total amount of disk used:** 36.51 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..." } ``` #### unshuffled_deduplicated_sw - **Size of downloaded dataset files:** 2.95 MB - **Size of the generated dataset:** 8.98 MB - **Total amount of disk used:** 11.92 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu." } ``` #### unshuffled_deduplicated_ta - **Size of downloaded dataset files:** 971.12 MB - **Size of the generated dataset:** 5.48 GB - **Total amount of disk used:** 6.45 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..." } ``` #### unshuffled_deduplicated_te - **Size of downloaded dataset files:** 342.43 MB - **Size of the generated dataset:** 1.70 GB - **Total amount of disk used:** 2.04 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..." } ``` #### unshuffled_deduplicated_tg - **Size of downloaded dataset files:** 62.90 MB - **Size of the generated dataset:** 261.68 MB - **Total amount of disk used:** 324.60 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..." } ``` #### unshuffled_deduplicated_th - **Size of downloaded dataset files:** 3.54 GB - **Size of the generated dataset:** 17.11 GB - **Total amount of disk used:** 20.65 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..." } ``` #### unshuffled_deduplicated_tk - **Size of downloaded dataset files:** 2.22 MB - **Size of the generated dataset:** 7.12 MB - **Total amount of disk used:** 9.34 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..." } ``` #### unshuffled_deduplicated_tl - **Size of downloaded dataset files:** 151.34 MB - **Size of the generated dataset:** 431.69 MB - **Total amount of disk used:** 583.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..." } ``` #### unshuffled_deduplicated_tr - **Size of downloaded dataset files:** 10.39 GB - **Size of the generated dataset:** 28.47 GB - **Total amount of disk used:** 38.86 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..." } ``` #### unshuffled_deduplicated_tt - **Size of downloaded dataset files:** 85.89 MB - **Size of the generated dataset:** 321.37 MB - **Total amount of disk used:** 407.26 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..." } ``` #### unshuffled_deduplicated_tyv - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..." } ``` #### unshuffled_deduplicated_ug - **Size of downloaded dataset files:** 20.53 MB - **Size of the generated dataset:** 86.44 MB - **Total amount of disk used:** 106.97 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..." } ``` #### unshuffled_deduplicated_uk - **Size of downloaded dataset files:** 8.04 GB - **Size of the generated dataset:** 29.86 GB - **Total amount of disk used:** 37.90 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..." } ``` #### unshuffled_deduplicated_ur - **Size of downloaded dataset files:** 483.59 MB - **Size of the generated dataset:** 1.82 GB - **Total amount of disk used:** 2.31 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..." } ``` #### unshuffled_deduplicated_uz - **Size of downloaded dataset files:** 4.30 MB - **Size of the generated dataset:** 12.00 MB - **Total amount of disk used:** 16.29 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan." } ``` #### unshuffled_deduplicated_vec - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..." } ``` #### unshuffled_deduplicated_vi - **Size of downloaded dataset files:** 10.71 GB - **Size of the generated dataset:** 33.60 GB - **Total amount of disk used:** 44.31 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..." } ``` #### unshuffled_deduplicated_vo - **Size of downloaded dataset files:** 0.30 MB - **Size of the generated dataset:** 2.10 MB - **Total amount of disk used:** 2.40 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L." } ``` #### unshuffled_deduplicated_wa - **Size of downloaded dataset files:** 0.08 MB - **Size of the generated dataset:** 0.22 MB - **Total amount of disk used:** 0.29 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete." } ``` #### unshuffled_deduplicated_war - **Size of downloaded dataset files:** 0.55 MB - **Size of the generated dataset:** 2.36 MB - **Total amount of disk used:** 2.90 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya." } ``` #### unshuffled_deduplicated_wuu - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.03 MB - **Total amount of disk used:** 0.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..." } ``` #### unshuffled_deduplicated_xal - **Size of downloaded dataset files:** 0.03 MB - **Size of the generated dataset:** 0.12 MB - **Total amount of disk used:** 0.15 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..." } ``` #### unshuffled_deduplicated_xmf - **Size of downloaded dataset files:** 0.94 MB - **Size of the generated dataset:** 4.63 MB - **Total amount of disk used:** 5.58 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..." } ``` #### unshuffled_deduplicated_yi - **Size of downloaded dataset files:** 22.20 MB - **Size of the generated dataset:** 88.29 MB - **Total amount of disk used:** 110.49 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..." } ``` #### unshuffled_deduplicated_yo - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.03 MB - **Total amount of disk used:** 0.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..." } ``` #### unshuffled_deduplicated_yue - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..." } ``` #### unshuffled_deduplicated_zh - **Size of downloaded dataset files:** 99.98 GB - **Size of the generated dataset:** 267.88 GB - **Total amount of disk used:** 367.86 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..." } ``` </details> <details> <summary>Click to expand the Data/size information for each language (original)</summary> #### unshuffled_original_af - **Size of downloaded dataset files:** 85.79 MB - **Size of the generated dataset:** 254.08 MB - **Total amount of disk used:** 339.87 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel" } ``` #### unshuffled_original_als - **Size of downloaded dataset files:** 1.49 MB - **Size of the generated dataset:** 5.30 MB - **Total amount of disk used:** 6.78 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..." } ``` #### unshuffled_original_am - **Size of downloaded dataset files:** 102.79 MB - **Size of the generated dataset:** 378.06 MB - **Total amount of disk used:** 480.85 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..." } ``` #### unshuffled_original_an - **Size of downloaded dataset files:** 0.15 MB - **Size of the generated dataset:** 1.33 MB - **Total amount of disk used:** 1.48 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..." } ``` #### unshuffled_original_ar - **Size of downloaded dataset files:** 22.23 GB - **Size of the generated dataset:** 87.94 GB - **Total amount of disk used:** 110.17 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..." } ``` #### unshuffled_original_arz - **Size of downloaded dataset files:** 15.90 MB - **Size of the generated dataset:** 70.13 MB - **Total amount of disk used:** 86.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..." } ``` #### unshuffled_original_as - **Size of downloaded dataset files:** 21.43 MB - **Size of the generated dataset:** 117.73 MB - **Total amount of disk used:** 139.17 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..." } ``` #### unshuffled_original_ast - **Size of downloaded dataset files:** 0.92 MB - **Size of the generated dataset:** 2.54 MB - **Total amount of disk used:** 3.46 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..." } ``` #### unshuffled_original_av - **Size of downloaded dataset files:** 0.08 MB - **Size of the generated dataset:** 0.42 MB - **Total amount of disk used:** 0.50 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..." } ``` #### unshuffled_original_az - **Size of downloaded dataset files:** 927.76 MB - **Size of the generated dataset:** 2.96 GB - **Total amount of disk used:** 3.89 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..." } ``` #### unshuffled_original_azb - **Size of downloaded dataset files:** 6.64 MB - **Size of the generated dataset:** 28.47 MB - **Total amount of disk used:** 35.11 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..." } ``` #### unshuffled_original_ba - **Size of downloaded dataset files:** 33.22 MB - **Size of the generated dataset:** 133.70 MB - **Total amount of disk used:** 166.92 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..." } ``` #### unshuffled_original_bar - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": " vo" } ``` #### unshuffled_original_bcl - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..." } ``` #### unshuffled_original_be - **Size of downloaded dataset files:** 498.29 MB - **Size of the generated dataset:** 1.88 GB - **Total amount of disk used:** 2.38 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..." } ``` #### unshuffled_original_bg - **Size of downloaded dataset files:** 8.34 GB - **Size of the generated dataset:** 33.75 GB - **Total amount of disk used:** 42.09 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..." } ``` #### unshuffled_original_bh - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.12 MB - **Total amount of disk used:** 0.13 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..." } ``` #### unshuffled_original_bn - **Size of downloaded dataset files:** 2.14 GB - **Size of the generated dataset:** 10.77 GB - **Total amount of disk used:** 12.91 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nভড়ং সর্বস্ব বাংলা আর..." } ``` #### unshuffled_original_bo - **Size of downloaded dataset files:** 28.94 MB - **Size of the generated dataset:** 195.40 MB - **Total amount of disk used:** 224.34 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..." } ``` #### unshuffled_original_bpy - **Size of downloaded dataset files:** 0.34 MB - **Size of the generated dataset:** 4.35 MB - **Total amount of disk used:** 4.69 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..." } ``` #### unshuffled_original_br - **Size of downloaded dataset files:** 9.18 MB - **Size of the generated dataset:** 30.20 MB - **Total amount of disk used:** 39.38 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..." } ``` #### unshuffled_original_bs - **Size of downloaded dataset files:** 0.05 MB - **Size of the generated dataset:** 0.48 MB - **Total amount of disk used:** 0.53 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..." } ``` #### unshuffled_original_bxr - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..." } ``` #### unshuffled_original_ca - **Size of downloaded dataset files:** 3.10 GB - **Size of the generated dataset:** 8.62 GB - **Total amount of disk used:** 11.73 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..." } ``` #### unshuffled_original_cbk - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..." } ``` #### unshuffled_original_ce - **Size of downloaded dataset files:** 2.09 MB - **Size of the generated dataset:** 8.73 MB - **Total amount of disk used:** 10.82 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..." } ``` #### unshuffled_original_ceb - **Size of downloaded dataset files:** 11.07 MB - **Size of the generated dataset:** 40.97 MB - **Total amount of disk used:** 52.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..." } ``` #### unshuffled_original_ckb - **Size of downloaded dataset files:** 111.88 MB - **Size of the generated dataset:** 510.97 MB - **Total amount of disk used:** 622.85 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..." } ``` #### unshuffled_original_cs - **Size of downloaded dataset files:** 21.72 GB - **Size of the generated dataset:** 57.08 GB - **Total amount of disk used:** 78.80 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..." } ``` #### unshuffled_original_cv - **Size of downloaded dataset files:** 9.40 MB - **Size of the generated dataset:** 41.05 MB - **Total amount of disk used:** 50.45 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..." } ``` #### unshuffled_original_cy - **Size of downloaded dataset files:** 81.74 MB - **Size of the generated dataset:** 224.93 MB - **Total amount of disk used:** 306.67 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..." } ``` #### unshuffled_original_da - **Size of downloaded dataset files:** 6.00 GB - **Size of the generated dataset:** 16.76 GB - **Total amount of disk used:** 22.76 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..." } ``` #### unshuffled_original_de - **Size of downloaded dataset files:** 119.51 GB - **Size of the generated dataset:** 331.22 GB - **Total amount of disk used:** 450.73 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..." } ``` #### unshuffled_original_diq - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:" } ``` #### unshuffled_original_dsb - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana." } ``` #### unshuffled_original_dv - **Size of downloaded dataset files:** 24.91 MB - **Size of the generated dataset:** 131.63 MB - **Total amount of disk used:** 156.54 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..." } ``` #### unshuffled_original_el - **Size of downloaded dataset files:** 17.31 GB - **Size of the generated dataset:** 66.27 GB - **Total amount of disk used:** 83.58 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..." } ``` #### unshuffled_original_eml - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..." } ``` #### unshuffled_original_en - **Size of downloaded dataset files:** 903.83 GB - **Size of the generated dataset:** 2525.44 GB - **Total amount of disk used:** 3429.27 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..." } ``` #### unshuffled_original_eo - **Size of downloaded dataset files:** 117.07 MB - **Size of the generated dataset:** 314.18 MB - **Total amount of disk used:** 431.27 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..." } ``` #### unshuffled_original_es - **Size of downloaded dataset files:** 106.04 GB - **Size of the generated dataset:** 298.49 GB - **Total amount of disk used:** 404.53 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..." } ``` #### unshuffled_original_et - **Size of downloaded dataset files:** 1.88 GB - **Size of the generated dataset:** 5.17 GB - **Total amount of disk used:** 7.06 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..." } ``` #### unshuffled_original_eu - **Size of downloaded dataset files:** 248.19 MB - **Size of the generated dataset:** 894.83 MB - **Total amount of disk used:** 1.14 GB An example of 'train' looks as follows. ``` { "id": 0, "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko." } ``` #### unshuffled_original_fa - **Size of downloaded dataset files:** 20.96 GB - **Size of the generated dataset:** 84.21 GB - **Total amount of disk used:** 105.17 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..." } ``` #### unshuffled_original_fi - **Size of downloaded dataset files:** 9.97 GB - **Size of the generated dataset:** 28.57 GB - **Total amount of disk used:** 38.54 GB An example of 'train' looks as follows. ``` { "id": 1, "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..." } ``` #### unshuffled_original_fr - **Size of downloaded dataset files:** 105.32 GB - **Size of the generated dataset:** 303.19 GB - **Total amount of disk used:** 408.51 GB An example of 'train' looks as follows. ``` { "id": 0, "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french" } ``` #### unshuffled_original_frr - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..." } ``` #### unshuffled_original_fy - **Size of downloaded dataset files:** 12.40 MB - **Size of the generated dataset:** 36.24 MB - **Total amount of disk used:** 48.64 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje." } ``` #### unshuffled_original_ga - **Size of downloaded dataset files:** 29.27 MB - **Size of the generated dataset:** 92.37 MB - **Total amount of disk used:** 121.63 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..." } ``` #### unshuffled_original_gd - **Size of downloaded dataset files:** 0.52 MB - **Size of the generated dataset:** 2.02 MB - **Total amount of disk used:** 2.55 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017." } ``` #### unshuffled_original_gl - **Size of downloaded dataset files:** 235.38 MB - **Size of the generated dataset:** 656.48 MB - **Total amount of disk used:** 891.87 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..." } ``` #### unshuffled_original_gn - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.04 MB - **Total amount of disk used:** 0.05 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"º ѐÆÚÓ À Ã Ð É Æ ¾ Ä ΠÀ ¼ Æ É ÄÛ = Ü Ý\\\"Þ ß†à á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..." } ``` #### unshuffled_original_gom - **Size of downloaded dataset files:** 0.44 MB - **Size of the generated dataset:** 2.25 MB - **Total amount of disk used:** 2.71 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..." } ``` #### unshuffled_original_gu - **Size of downloaded dataset files:** 232.02 MB - **Size of the generated dataset:** 1.09 GB - **Total amount of disk used:** 1.33 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..." } ``` #### unshuffled_original_he - **Size of downloaded dataset files:** 5.66 GB - **Size of the generated dataset:** 21.11 GB - **Total amount of disk used:** 26.77 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..." } ``` #### unshuffled_original_hi - **Size of downloaded dataset files:** 3.66 GB - **Size of the generated dataset:** 17.93 GB - **Total amount of disk used:** 21.59 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्‍सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..." } ``` #### unshuffled_original_hr - **Size of downloaded dataset files:** 79.42 MB - **Size of the generated dataset:** 243.83 MB - **Total amount of disk used:** 323.24 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..." } ``` #### unshuffled_original_hsb - **Size of downloaded dataset files:** 1.39 MB - **Size of the generated dataset:** 4.49 MB - **Total amount of disk used:** 5.87 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..." } ``` #### unshuffled_original_ht - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..." } ``` #### unshuffled_original_hu - **Size of downloaded dataset files:** 15.69 GB - **Size of the generated dataset:** 43.07 GB - **Total amount of disk used:** 58.77 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..." } ``` #### unshuffled_original_hy - **Size of downloaded dataset files:** 897.36 MB - **Size of the generated dataset:** 3.94 GB - **Total amount of disk used:** 4.84 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..." } ``` #### unshuffled_original_ia - **Size of downloaded dataset files:** 0.08 MB - **Size of the generated dataset:** 0.69 MB - **Total amount of disk used:** 0.78 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..." } ``` #### unshuffled_original_id - **Size of downloaded dataset files:** 10.60 GB - **Size of the generated dataset:** 32.32 GB - **Total amount of disk used:** 42.91 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..." } ``` #### unshuffled_original_ie - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo" } ``` #### unshuffled_original_ilo - **Size of downloaded dataset files:** 0.27 MB - **Size of the generated dataset:** 0.92 MB - **Total amount of disk used:** 1.20 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..." } ``` #### unshuffled_original_io - **Size of downloaded dataset files:** 0.04 MB - **Size of the generated dataset:** 0.16 MB - **Total amount of disk used:** 0.20 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..." } ``` #### unshuffled_original_is - **Size of downloaded dataset files:** 533.03 MB - **Size of the generated dataset:** 1.52 GB - **Total amount of disk used:** 2.06 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..." } ``` #### unshuffled_original_it - **Size of downloaded dataset files:** 52.16 GB - **Size of the generated dataset:** 147.38 GB - **Total amount of disk used:** 199.54 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..." } ``` #### unshuffled_original_ja - **Size of downloaded dataset files:** 79.56 GB - **Size of the generated dataset:** 232.22 GB - **Total amount of disk used:** 311.78 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..." } ``` #### unshuffled_original_jbo - **Size of downloaded dataset files:** 0.21 MB - **Size of the generated dataset:** 0.77 MB - **Total amount of disk used:** 0.98 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei" } ``` #### unshuffled_original_jv - **Size of downloaded dataset files:** 0.22 MB - **Size of the generated dataset:** 0.69 MB - **Total amount of disk used:** 0.91 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..." } ``` #### unshuffled_original_ka - **Size of downloaded dataset files:** 680.74 MB - **Size of the generated dataset:** 3.77 GB - **Total amount of disk used:** 4.45 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..." } ``` #### unshuffled_original_kk - **Size of downloaded dataset files:** 615.06 MB - **Size of the generated dataset:** 2.83 GB - **Total amount of disk used:** 3.45 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..." } ``` #### unshuffled_original_km - **Size of downloaded dataset files:** 193.28 MB - **Size of the generated dataset:** 1.10 GB - **Total amount of disk used:** 1.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..." } ``` #### unshuffled_original_kn - **Size of downloaded dataset files:** 342.15 MB - **Size of the generated dataset:** 1.76 GB - **Total amount of disk used:** 2.11 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..." } ``` #### unshuffled_original_ko - **Size of downloaded dataset files:** 8.81 GB - **Size of the generated dataset:** 25.29 GB - **Total amount of disk used:** 34.10 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..." } ``` #### unshuffled_original_krc - **Size of downloaded dataset files:** 0.66 MB - **Size of the generated dataset:** 2.68 MB - **Total amount of disk used:** 3.34 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..." } ``` #### unshuffled_original_ku - **Size of downloaded dataset files:** 33.38 MB - **Size of the generated dataset:** 99.06 MB - **Total amount of disk used:** 132.44 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..." } ``` #### unshuffled_original_kv - **Size of downloaded dataset files:** 0.40 MB - **Size of the generated dataset:** 2.38 MB - **Total amount of disk used:** 2.78 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..." } ``` #### unshuffled_original_kw - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.04 MB - **Total amount of disk used:** 0.05 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..." } ``` #### unshuffled_original_ky - **Size of downloaded dataset files:** 152.64 MB - **Size of the generated dataset:** 630.79 MB - **Total amount of disk used:** 783.43 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..." } ``` #### unshuffled_original_la - **Size of downloaded dataset files:** 5.46 MB - **Size of the generated dataset:** 27.80 MB - **Total amount of disk used:** 33.26 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..." } ``` #### unshuffled_original_lb - **Size of downloaded dataset files:** 10.73 MB - **Size of the generated dataset:** 30.60 MB - **Total amount of disk used:** 41.32 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..." } ``` #### unshuffled_original_lez - **Size of downloaded dataset files:** 0.83 MB - **Size of the generated dataset:** 3.38 MB - **Total amount of disk used:** 4.20 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..." } ``` #### unshuffled_original_li - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.03 MB - **Total amount of disk used:** 0.04 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..." } ``` #### unshuffled_original_lmo - **Size of downloaded dataset files:** 0.10 MB - **Size of the generated dataset:** 0.47 MB - **Total amount of disk used:** 0.58 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..." } ``` #### unshuffled_original_lo - **Size of downloaded dataset files:** 33.92 MB - **Size of the generated dataset:** 182.36 MB - **Total amount of disk used:** 216.28 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..." } ``` #### unshuffled_original_lrc - **Size of downloaded dataset files:** 0.02 MB - **Size of the generated dataset:** 0.07 MB - **Total amount of disk used:** 0.09 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..." } ``` #### unshuffled_original_lt - **Size of downloaded dataset files:** 3.44 GB - **Size of the generated dataset:** 9.45 GB - **Total amount of disk used:** 12.89 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..." } ``` #### unshuffled_original_lv - **Size of downloaded dataset files:** 1.49 GB - **Size of the generated dataset:** 4.27 GB - **Total amount of disk used:** 5.75 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..." } ``` #### unshuffled_original_mai - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.33 MB - **Total amount of disk used:** 0.34 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..." } ``` #### unshuffled_original_mg - **Size of downloaded dataset files:** 6.22 MB - **Size of the generated dataset:** 21.79 MB - **Total amount of disk used:** 28.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..." } ``` #### unshuffled_original_mhr - **Size of downloaded dataset files:** 1.84 MB - **Size of the generated dataset:** 7.55 MB - **Total amount of disk used:** 9.38 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..." } ``` #### unshuffled_original_min - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.63 MB - **Total amount of disk used:** 0.64 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ..." } ``` #### unshuffled_original_mk - **Size of downloaded dataset files:** 508.24 MB - **Size of the generated dataset:** 2.20 GB - **Total amount of disk used:** 2.71 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..." } ``` #### unshuffled_original_ml - **Size of downloaded dataset files:** 938.69 MB - **Size of the generated dataset:** 5.24 GB - **Total amount of disk used:** 6.18 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"സ്ത്രീ പ്രവേശനം സര്‍ക്കാര്‍ പൂര്‍ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില്‍ ഇടപെടുമെന്നും സര്‍ക്കാര്‍ ഹൈക്കോടതിയില്‍\\..." } ``` #### unshuffled_original_mn - **Size of downloaded dataset files:** 472.36 MB - **Size of the generated dataset:** 2.33 GB - **Total amount of disk used:** 2.81 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Монгол улс, Улаанбаатар хот - 14191 Энхтайваны өргөн чөлөө - 10, Багш хөгжлийн ордон, Багшийн мэргэжил дээшлүүлэх институт\\nБаг..." } ``` #### unshuffled_original_mr - **Size of downloaded dataset files:** 525.31 MB - **Size of the generated dataset:** 2.82 GB - **Total amount of disk used:** 3.34 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..." } ``` #### unshuffled_original_mrj - **Size of downloaded dataset files:** 0.30 MB - **Size of the generated dataset:** 1.16 MB - **Total amount of disk used:** 1.47 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..." } ``` #### unshuffled_original_ms - **Size of downloaded dataset files:** 28.46 MB - **Size of the generated dataset:** 122.33 MB - **Total amount of disk used:** 150.79 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..." } ``` #### unshuffled_original_mt - **Size of downloaded dataset files:** 7.53 MB - **Size of the generated dataset:** 24.47 MB - **Total amount of disk used:** 32.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;" } ``` #### unshuffled_original_mwl - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..." } ``` #### unshuffled_original_my - **Size of downloaded dataset files:** 369.85 MB - **Size of the generated dataset:** 2.02 GB - **Total amount of disk used:** 2.39 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..." } ``` #### unshuffled_original_myv - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..." } ``` #### unshuffled_original_mzn - **Size of downloaded dataset files:** 0.18 MB - **Size of the generated dataset:** 0.72 MB - **Total amount of disk used:** 0.90 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنی‌یه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..." } ``` #### unshuffled_original_nah - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl." } ``` #### unshuffled_original_nap - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.02 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..." } ``` #### unshuffled_original_nds - **Size of downloaded dataset files:** 6.74 MB - **Size of the generated dataset:** 18.23 MB - **Total amount of disk used:** 24.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..." } ``` #### unshuffled_original_ne - **Size of downloaded dataset files:** 355.29 MB - **Size of the generated dataset:** 1.87 GB - **Total amount of disk used:** 2.22 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..." } ``` #### unshuffled_original_new - **Size of downloaded dataset files:** 1.03 MB - **Size of the generated dataset:** 5.77 MB - **Total amount of disk used:** 6.79 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..." } ``` #### unshuffled_original_nl - **Size of downloaded dataset files:** 29.35 GB - **Size of the generated dataset:** 83.23 GB - **Total amount of disk used:** 112.58 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..." } ``` #### unshuffled_original_nn - **Size of downloaded dataset files:** 32.86 MB - **Size of the generated dataset:** 90.84 MB - **Total amount of disk used:** 123.70 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag" } ``` #### unshuffled_original_no - **Size of downloaded dataset files:** 3.11 GB - **Size of the generated dataset:** 8.65 GB - **Total amount of disk used:** 11.76 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..." } ``` #### unshuffled_original_oc - **Size of downloaded dataset files:** 1.57 MB - **Size of the generated dataset:** 6.12 MB - **Total amount of disk used:** 7.71 MB An example of 'train' looks as follows. ``` { "id": 1, "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru." } ``` #### unshuffled_original_or - **Size of downloaded dataset files:** 49.84 MB - **Size of the generated dataset:** 260.15 MB - **Total amount of disk used:** 309.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..." } ``` #### unshuffled_original_os - **Size of downloaded dataset files:** 3.09 MB - **Size of the generated dataset:** 12.90 MB - **Total amount of disk used:** 15.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..." } ``` #### unshuffled_original_pa - **Size of downloaded dataset files:** 164.21 MB - **Size of the generated dataset:** 801.16 MB - **Total amount of disk used:** 965.37 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..." } ``` #### unshuffled_original_pam - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..." } ``` #### unshuffled_original_pl - **Size of downloaded dataset files:** 42.88 GB - **Size of the generated dataset:** 117.12 GB - **Total amount of disk used:** 160.01 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..." } ``` #### unshuffled_original_pms - **Size of downloaded dataset files:** 0.75 MB - **Size of the generated dataset:** 2.15 MB - **Total amount of disk used:** 2.92 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..." } ``` #### unshuffled_original_pnb - **Size of downloaded dataset files:** 3.22 MB - **Size of the generated dataset:** 12.04 MB - **Total amount of disk used:** 15.26 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..." } ``` #### unshuffled_original_ps - **Size of downloaded dataset files:** 103.66 MB - **Size of the generated dataset:** 379.51 MB - **Total amount of disk used:** 483.17 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..." } ``` #### unshuffled_original_pt - **Size of downloaded dataset files:** 47.26 GB - **Size of the generated dataset:** 132.64 GB - **Total amount of disk used:** 179.89 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..." } ``` #### unshuffled_original_qu - **Size of downloaded dataset files:** 0.02 MB - **Size of the generated dataset:** 0.08 MB - **Total amount of disk used:** 0.10 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi." } ``` #### unshuffled_original_rm - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..." } ``` #### unshuffled_original_ro - **Size of downloaded dataset files:** 9.53 GB - **Size of the generated dataset:** 26.87 GB - **Total amount of disk used:** 36.40 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..." } ``` #### unshuffled_original_ru - **Size of downloaded dataset files:** 319.76 GB - **Size of the generated dataset:** 1241.63 GB - **Total amount of disk used:** 1561.38 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..." } ``` #### unshuffled_original_sa - **Size of downloaded dataset files:** 17.52 MB - **Size of the generated dataset:** 97.06 MB - **Total amount of disk used:** 114.58 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्‍प्रति समाप्‍ता अस्ति । तस्‍य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..." } ``` #### unshuffled_original_sah - **Size of downloaded dataset files:** 9.08 MB - **Size of the generated dataset:** 43.82 MB - **Total amount of disk used:** 52.90 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..." } ``` #### unshuffled_original_scn - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` { "id": 0, "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati." } ``` #### unshuffled_original_sd - **Size of downloaded dataset files:** 90.62 MB - **Size of the generated dataset:** 364.25 MB - **Total amount of disk used:** 454.88 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..." } ``` #### unshuffled_original_sh - **Size of downloaded dataset files:** 3.46 MB - **Size of the generated dataset:** 25.84 MB - **Total amount of disk used:** 29.30 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..." } ``` #### unshuffled_original_si - **Size of downloaded dataset files:** 310.93 MB - **Size of the generated dataset:** 1.47 GB - **Total amount of disk used:** 1.78 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..." } ``` #### unshuffled_original_sk - **Size of downloaded dataset files:** 3.71 GB - **Size of the generated dataset:** 9.81 GB - **Total amount of disk used:** 13.52 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..." } ``` #### unshuffled_original_sl - **Size of downloaded dataset files:** 956.20 MB - **Size of the generated dataset:** 2.68 GB - **Total amount of disk used:** 3.63 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..." } ``` #### unshuffled_original_so - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.06 MB - **Total amount of disk used:** 0.06 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..." } ``` #### unshuffled_original_sq - **Size of downloaded dataset files:** 861.84 MB - **Size of the generated dataset:** 2.44 GB - **Total amount of disk used:** 3.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..." } ``` #### unshuffled_original_sr - **Size of downloaded dataset files:** 1.08 GB - **Size of the generated dataset:** 4.13 GB - **Total amount of disk used:** 5.21 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..." } ``` #### unshuffled_original_su - **Size of downloaded dataset files:** 0.06 MB - **Size of the generated dataset:** 0.23 MB - **Total amount of disk used:** 0.28 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]" } ``` #### unshuffled_original_sv - **Size of downloaded dataset files:** 17.18 GB - **Size of the generated dataset:** 47.00 GB - **Total amount of disk used:** 64.18 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..." } ``` #### unshuffled_original_sw - **Size of downloaded dataset files:** 3.71 MB - **Size of the generated dataset:** 14.07 MB - **Total amount of disk used:** 17.78 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu." } ``` #### unshuffled_original_ta - **Size of downloaded dataset files:** 1.74 GB - **Size of the generated dataset:** 9.93 GB - **Total amount of disk used:** 11.67 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..." } ``` #### unshuffled_original_te - **Size of downloaded dataset files:** 522.47 MB - **Size of the generated dataset:** 2.61 GB - **Total amount of disk used:** 3.13 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..." } ``` #### unshuffled_original_tg - **Size of downloaded dataset files:** 90.97 MB - **Size of the generated dataset:** 397.43 MB - **Total amount of disk used:** 488.41 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..." } ``` #### unshuffled_original_th - **Size of downloaded dataset files:** 7.38 GB - **Size of the generated dataset:** 38.29 GB - **Total amount of disk used:** 45.67 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..." } ``` #### unshuffled_original_tk - **Size of downloaded dataset files:** 2.96 MB - **Size of the generated dataset:** 10.66 MB - **Total amount of disk used:** 13.62 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..." } ``` #### unshuffled_original_tl - **Size of downloaded dataset files:** 204.89 MB - **Size of the generated dataset:** 606.30 MB - **Total amount of disk used:** 811.19 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..." } ``` #### unshuffled_original_tr - **Size of downloaded dataset files:** 21.96 GB - **Size of the generated dataset:** 63.58 GB - **Total amount of disk used:** 85.54 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..." } ``` #### unshuffled_original_tt - **Size of downloaded dataset files:** 151.06 MB - **Size of the generated dataset:** 703.42 MB - **Total amount of disk used:** 854.47 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..." } ``` #### unshuffled_original_tyv - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.01 MB - **Total amount of disk used:** 0.01 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..." } ``` #### unshuffled_original_ug - **Size of downloaded dataset files:** 27.92 MB - **Size of the generated dataset:** 127.42 MB - **Total amount of disk used:** 155.35 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..." } ``` #### unshuffled_original_uk - **Size of downloaded dataset files:** 14.42 GB - **Size of the generated dataset:** 56.44 GB - **Total amount of disk used:** 70.86 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..." } ``` #### unshuffled_original_ur - **Size of downloaded dataset files:** 712.61 MB - **Size of the generated dataset:** 2.80 GB - **Total amount of disk used:** 3.51 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..." } ``` #### unshuffled_original_uz - **Size of downloaded dataset files:** 5.78 MB - **Size of the generated dataset:** 21.46 MB - **Total amount of disk used:** 27.24 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan." } ``` #### unshuffled_original_vec - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.02 MB - **Total amount of disk used:** 0.03 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..." } ``` #### unshuffled_original_vi - **Size of downloaded dataset files:** 21.50 GB - **Size of the generated dataset:** 72.23 GB - **Total amount of disk used:** 93.73 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..." } ``` #### unshuffled_original_vo - **Size of downloaded dataset files:** 0.30 MB - **Size of the generated dataset:** 2.12 MB - **Total amount of disk used:** 2.42 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L." } ``` #### unshuffled_original_wa - **Size of downloaded dataset files:** 0.09 MB - **Size of the generated dataset:** 0.29 MB - **Total amount of disk used:** 0.38 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete." } ``` #### unshuffled_original_war - **Size of downloaded dataset files:** 0.64 MB - **Size of the generated dataset:** 2.68 MB - **Total amount of disk used:** 3.32 MB An example of 'train' looks as follows. ``` { "id": 1, "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya." } ``` #### unshuffled_original_wuu - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.12 MB - **Total amount of disk used:** 0.13 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..." } ``` #### unshuffled_original_xal - **Size of downloaded dataset files:** 0.03 MB - **Size of the generated dataset:** 0.12 MB - **Total amount of disk used:** 0.15 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..." } ``` #### unshuffled_original_xmf - **Size of downloaded dataset files:** 1.05 MB - **Size of the generated dataset:** 6.12 MB - **Total amount of disk used:** 7.17 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..." } ``` #### unshuffled_original_yi - **Size of downloaded dataset files:** 33.33 MB - **Size of the generated dataset:** 147.60 MB - **Total amount of disk used:** 180.94 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..." } ``` #### unshuffled_original_yo - **Size of downloaded dataset files:** 0.01 MB - **Size of the generated dataset:** 0.06 MB - **Total amount of disk used:** 0.06 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 0, "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..." } ``` #### unshuffled_original_yue - **Size of downloaded dataset files:** 0.00 MB - **Size of the generated dataset:** 0.00 MB - **Total amount of disk used:** 0.00 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..." } ``` #### unshuffled_original_zh - **Size of downloaded dataset files:** 206.00 GB - **Size of the generated dataset:** 545.61 GB - **Total amount of disk used:** 751.61 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": 1, "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..." } ``` </details> ### Data Fields The data fields are the same among all configs. - `id`: a `int64` feature. - `text`: a `string` feature. ### Data Splits <details> <summary>Click to expand the number of samples per configuration</summary> | Language | Language code | Name original | Train original | Words original | Size original | Name deduplicated | Train deduplicated | Words deduplicated | Size deduplicated | | ----------------- | ------------- | ----------------------- | -------------- | --------------- | ------------- | --------------------------- | ------------------ | ------------------ | ----------------- | | Afrikaans | af | unshuffled_original_af | 201117 | 43,482,801 | 241M | unshuffled_deduplicated_af | 130640 | 29,533,437 | 163M | | Albanian | sq | unshuffled_original_sq | 672077 | 374,196,110 | 2.3G | unshuffled_deduplicated_sq | 461598 | 186,856,699 | 1.2G | | Alemannic | als | unshuffled_original_als | 7324 | 841,750 | 5.0M | unshuffled_deduplicated_als | 4518 | 459,001 | 2.8M | | Amharic | am | unshuffled_original_am | 83663 | 28,301,601 | 360M | unshuffled_deduplicated_am | 43102 | 16,086,628 | 206M | | Arabic | ar | unshuffled_original_ar | 16365602 | 8,117,162,828 | 82G | unshuffled_deduplicated_ar | 9006977 | 3,171,221,354 | 32G | | Aragonese | an | unshuffled_original_an | 2449 | 52,896 | 1.3M | unshuffled_deduplicated_an | 2025 | 45,669 | 801K | | Armenian | hy | unshuffled_original_hy | 659430 | 273,919,388 | 3.7G | unshuffled_deduplicated_hy | 396093 | 110,196,043 | 1.5G | | Assamese | as | unshuffled_original_as | 14985 | 6,956,663 | 113M | unshuffled_deduplicated_as | 9212 | 4,366,570 | 71M | | Asturian | ast | unshuffled_original_ast | 6999 | 381,005 | 2.4M | unshuffled_deduplicated_ast | 5343 | 325,237 | 2.0M | | Avaric | av | unshuffled_original_av | 456 | 24,720 | 409K | unshuffled_deduplicated_av | 360 | 19,478 | 324K | | Azerbaijani | az | unshuffled_original_az | 912330 | 322,641,710 | 2.8G | unshuffled_deduplicated_az | 626796 | 167,742,296 | 1.5G | | Bashkir | ba | unshuffled_original_ba | 42551 | 9,796,764 | 128M | unshuffled_deduplicated_ba | 27050 | 6,922,589 | 90M | | Basque | eu | unshuffled_original_eu | 506883 | 120,456,652 | 848M | unshuffled_deduplicated_eu | 256513 | 45,359,710 | 342M | | Bavarian | bar | unshuffled_original_bar | 4 | 399 | 503 | unshuffled_deduplicated_bar | 4 | 399 | 503 | | Belarusian | be | unshuffled_original_be | 586031 | 144,579,630 | 1.8G | unshuffled_deduplicated_be | 307405 | 83,499,037 | 1.1G | | Bengali | bn | unshuffled_original_bn | 1675515 | 623,575,733 | 11G | unshuffled_deduplicated_bn | 1114481 | 363,766,143 | 5.8G | | Bihari | bh | unshuffled_original_bh | 336 | 8,848 | 110K | unshuffled_deduplicated_bh | 82 | 2,875 | 34K | | Bishnupriya | bpy | unshuffled_original_bpy | 6046 | 198,286 | 4.1M | unshuffled_deduplicated_bpy | 1770 | 96,940 | 1.7M | | Bosnian | bs | unshuffled_original_bs | 2143 | 106,448 | 447K | unshuffled_deduplicated_bs | 702 | 20,485 | 116K | | Breton | br | unshuffled_original_br | 37085 | 5,013,241 | 29M | unshuffled_deduplicated_br | 14724 | 2,890,384 | 16M | | Bulgarian | bg | unshuffled_original_bg | 5869686 | 2,947,648,106 | 32G | unshuffled_deduplicated_bg | 3398679 | 1,268,114,977 | 14G | | Burmese | my | unshuffled_original_my | 232329 | 56,111,184 | 1.9G | unshuffled_deduplicated_my | 136639 | 30,102,173 | 1.1G | | Catalan | ca | unshuffled_original_ca | 4390754 | 1,360,212,450 | 8.0G | unshuffled_deduplicated_ca | 2458067 | 729,333,440 | 4.3G | | Cebuano | ceb | unshuffled_original_ceb | 56248 | 6,603,567 | 39M | unshuffled_deduplicated_ceb | 26145 | 3,675,024 | 24M | | Central Bikol | bcl | unshuffled_original_bcl | 1 | 312 | 885 | unshuffled_deduplicated_bcl | 1 | 312 | 885 | | Central Khmer | km | unshuffled_original_km | 159363 | 20,690,610 | 1.1G | unshuffled_deduplicated_km | 108346 | 10,082,245 | 581M | | Central Kurdish | ckb | unshuffled_original_ckb | 103639 | 48,478,334 | 487M | unshuffled_deduplicated_ckb | 68210 | 18,726,721 | 226M | | Chavacano | cbk | unshuffled_original_cbk | 1 | 130 | 520 | unshuffled_deduplicated_cbk | 1 | 130 | 520 | | Chechen | ce | unshuffled_original_ce | 4042 | 711,051 | 8.3M | unshuffled_deduplicated_ce | 2984 | 568,146 | 6.7M | | Chinese | zh | unshuffled_original_zh | 60137667 | 14,986,424,850 | 508G | unshuffled_deduplicated_zh | 41708901 | 6,350,215,113 | 249G | | Chuvash | cv | unshuffled_original_cv | 20281 | 3,041,614 | 39M | unshuffled_deduplicated_cv | 10130 | 2,054,810 | 26M | | Cornish | kw | unshuffled_original_kw | 203 | 8,329 | 44K | unshuffled_deduplicated_kw | 68 | 2,704 | 14K | | Croatian | hr | unshuffled_original_hr | 582219 | 34,232,765 | 226M | unshuffled_deduplicated_hr | 321484 | 16,727,640 | 110M | | Czech | cs | unshuffled_original_cs | 21001388 | 7,715,977,441 | 53G | unshuffled_deduplicated_cs | 12308039 | 3,540,997,509 | 24G | | Danish | da | unshuffled_original_da | 7664010 | 2,637,463,889 | 16G | unshuffled_deduplicated_da | 4771098 | 1,620,091,317 | 9.5G | | Dhivehi | dv | unshuffled_original_dv | 21018 | 7,559,472 | 126M | unshuffled_deduplicated_dv | 17024 | 4,726,660 | 79M | | Dimli | diq | unshuffled_original_diq | 1 | 19 | 146 | unshuffled_deduplicated_diq | 1 | 19 | 146 | | Dutch | nl | unshuffled_original_nl | 34682142 | 13,020,136,373 | 78G | unshuffled_deduplicated_nl | 20812149 | 6,598,786,137 | 39G | | Eastern Mari | mhr | unshuffled_original_mhr | 3212 | 565,992 | 7.2M | unshuffled_deduplicated_mhr | 2515 | 469,297 | 6.0M | | Egyptian Arabic | arz | unshuffled_original_arz | 158113 | 7,305,151 | 66M | unshuffled_deduplicated_arz | 79928 | 3,659,419 | 33M | | Emilian-Romagnol | eml | unshuffled_original_eml | 84 | 6,376 | 25K | unshuffled_deduplicated_eml | 80 | 6,121 | 24K | | English | en | unshuffled_original_en | 455994980 | 418,187,793,408 | 2.3T | unshuffled_deduplicated_en | 304230423 | 215,841,256,971 | 1.2T | | Erzya | myv | unshuffled_original_myv | 6 | 90 | 1.4K | unshuffled_deduplicated_myv | 5 | 78 | 1.2K | | Esperanto | eo | unshuffled_original_eo | 121171 | 48,486,161 | 299M | unshuffled_deduplicated_eo | 84752 | 37,324,446 | 228M | | Estonian | et | unshuffled_original_et | 2093621 | 643,163,730 | 4.8G | unshuffled_deduplicated_et | 1172041 | 309,931,463 | 2.3G | | Finnish | fi | unshuffled_original_fi | 8557453 | 3,196,666,419 | 27G | unshuffled_deduplicated_fi | 5326443 | 1,597,855,468 | 13G | | French | fr | unshuffled_original_fr | 96742378 | 46,896,036,417 | 282G | unshuffled_deduplicated_fr | 59448891 | 23,206,776,649 | 138G | | Galician | gl | unshuffled_original_gl | 544388 | 102,011,291 | 620M | unshuffled_deduplicated_gl | 284320 | 63,600,602 | 384M | | Georgian | ka | unshuffled_original_ka | 563916 | 171,950,621 | 3.6G | unshuffled_deduplicated_ka | 372158 | 91,569,739 | 1.9G | | German | de | unshuffled_original_de | 104913504 | 44,878,908,446 | 308G | unshuffled_deduplicated_de | 62398034 | 21,529,164,172 | 145G | | Goan Konkani | gom | unshuffled_original_gom | 640 | 124,277 | 2.2M | unshuffled_deduplicated_gom | 484 | 102,306 | 1.8M | | Guarani | gn | unshuffled_original_gn | 106 | 7,382 | 36K | unshuffled_deduplicated_gn | 68 | 4,680 | 24K | | Gujarati | gu | unshuffled_original_gu | 240691 | 72,045,701 | 1.1G | unshuffled_deduplicated_gu | 169834 | 50,023,432 | 722M | | Haitian | ht | unshuffled_original_ht | 13 | 1,014 | 3.9K | unshuffled_deduplicated_ht | 9 | 832 | 3.3K | | Hebrew | he | unshuffled_original_he | 3808397 | 2,067,753,528 | 20G | unshuffled_deduplicated_he | 2375030 | 1,032,018,056 | 9.8G | | Hindi | hi | unshuffled_original_hi | 3264660 | 1,372,234,782 | 17G | unshuffled_deduplicated_hi | 1909387 | 745,774,934 | 8.9G | | Hungarian | hu | unshuffled_original_hu | 11197780 | 5,163,936,345 | 40G | unshuffled_deduplicated_hu | 6582908 | 2,339,127,555 | 18G | | Icelandic | is | unshuffled_original_is | 625673 | 219,900,094 | 1.5G | unshuffled_deduplicated_is | 389515 | 129,818,331 | 846M | | Ido | io | unshuffled_original_io | 694 | 25,702 | 147K | unshuffled_deduplicated_io | 617 | 22,773 | 130K | | Iloko | ilo | unshuffled_original_ilo | 2638 | 142,942 | 874K | unshuffled_deduplicated_ilo | 1578 | 105,564 | 636K | | Indonesian | id | unshuffled_original_id | 16236463 | 4,574,692,265 | 30G | unshuffled_deduplicated_id | 9948521 | 2,394,957,629 | 16G | | Interlingua | ia | unshuffled_original_ia | 1040 | 180,231 | 662K | unshuffled_deduplicated_ia | 529 | 100,019 | 360K | | Interlingue | ie | unshuffled_original_ie | 101 | 5,352 | 24K | unshuffled_deduplicated_ie | 11 | 602 | 1.6K | | Irish | ga | unshuffled_original_ga | 83223 | 14,483,593 | 88M | unshuffled_deduplicated_ga | 46493 | 10,017,303 | 60M | | Italian | it | unshuffled_original_it | 46981781 | 22,248,707,341 | 137G | unshuffled_deduplicated_it | 28522082 | 11,250,012,896 | 69G | | Japanese | ja | unshuffled_original_ja | 62721527 | 4,962,979,182 | 216G | unshuffled_deduplicated_ja | 39496439 | 1,123,067,063 | 106G | | Javanese | jv | unshuffled_original_jv | 1445 | 104,896 | 659K | unshuffled_deduplicated_jv | 1163 | 86,654 | 583K | | Kalmyk | xal | unshuffled_original_xal | 39 | 10,277 | 113K | unshuffled_deduplicated_xal | 36 | 10,155 | 112K | | Kannada | kn | unshuffled_original_kn | 350363 | 81,186,863 | 1.7G | unshuffled_deduplicated_kn | 251064 | 49,343,462 | 1.1G | | Karachay-Balkar | krc | unshuffled_original_krc | 1581 | 185,436 | 2.6M | unshuffled_deduplicated_krc | 1377 | 166,496 | 2.3M | | Kazakh | kk | unshuffled_original_kk | 524591 | 191,126,469 | 2.7G | unshuffled_deduplicated_kk | 338073 | 108,388,743 | 1.5G | | Kirghiz | ky | unshuffled_original_ky | 146993 | 44,194,823 | 600M | unshuffled_deduplicated_ky | 86561 | 28,982,620 | 388M | | Komi | kv | unshuffled_original_kv | 1549 | 201,404 | 2.3M | unshuffled_deduplicated_kv | 924 | 95,243 | 1.2M | | Korean | ko | unshuffled_original_ko | 7345075 | 2,368,765,142 | 24G | unshuffled_deduplicated_ko | 3675420 | 1,120,375,149 | 12G | | Kurdish | ku | unshuffled_original_ku | 46535 | 15,561,003 | 94M | unshuffled_deduplicated_ku | 29054 | 9,946,440 | 60M | | Lao | lo | unshuffled_original_lo | 52910 | 4,133,311 | 174M | unshuffled_deduplicated_lo | 32652 | 2,583,342 | 114M | | Latin | la | unshuffled_original_la | 94588 | 4,122,201 | 26M | unshuffled_deduplicated_la | 18808 | 1,328,038 | 8.3M | | Latvian | lv | unshuffled_original_lv | 1593820 | 520,761,977 | 4.0G | unshuffled_deduplicated_lv | 843195 | 236,428,905 | 1.8G | | Lezghian | lez | unshuffled_original_lez | 1485 | 247,646 | 3.3M | unshuffled_deduplicated_lez | 1381 | 224,871 | 3.0M | | Limburgan | li | unshuffled_original_li | 137 | 4,730 | 29K | unshuffled_deduplicated_li | 118 | 4,283 | 27K | | Lithuanian | lt | unshuffled_original_lt | 2977757 | 1,159,661,742 | 8.8G | unshuffled_deduplicated_lt | 1737411 | 516,183,525 | 3.9G | | Lojban | jbo | unshuffled_original_jbo | 832 | 154,330 | 736K | unshuffled_deduplicated_jbo | 617 | 141,973 | 678K | | Lombard | lmo | unshuffled_original_lmo | 1401 | 75,229 | 443K | unshuffled_deduplicated_lmo | 1374 | 73,665 | 433K | | Low German | nds | unshuffled_original_nds | 18174 | 2,906,347 | 18M | unshuffled_deduplicated_nds | 8714 | 2,146,417 | 13M | | Lower Sorbian | dsb | unshuffled_original_dsb | 65 | 1,787 | 13K | unshuffled_deduplicated_dsb | 37 | 966 | 7.1K | | Luxembourgish | lb | unshuffled_original_lb | 34807 | 4,403,577 | 29M | unshuffled_deduplicated_lb | 21735 | 3,087,650 | 21M | | Macedonian | mk | unshuffled_original_mk | 437871 | 189,289,873 | 2.1G | unshuffled_deduplicated_mk | 299457 | 102,849,595 | 1.2G | | Maithili | mai | unshuffled_original_mai | 123 | 69,161 | 317K | unshuffled_deduplicated_mai | 25 | 874 | 11K | | Malagasy | mg | unshuffled_original_mg | 17957 | 3,068,360 | 21M | unshuffled_deduplicated_mg | 13343 | 1,872,044 | 13M | | Malay | ms | unshuffled_original_ms | 534016 | 16,696,882 | 111M | unshuffled_deduplicated_ms | 183443 | 6,045,753 | 42M | | Malayalam | ml | unshuffled_original_ml | 603937 | 189,534,472 | 4.9G | unshuffled_deduplicated_ml | 453904 | 95,892,551 | 2.5G | | Maltese | mt | unshuffled_original_mt | 26598 | 2,995,654 | 24M | unshuffled_deduplicated_mt | 16383 | 2,163,358 | 17M | | Marathi | mr | unshuffled_original_mr | 326804 | 162,609,404 | 2.7G | unshuffled_deduplicated_mr | 212556 | 82,130,803 | 1.4G | | Mazanderani | mzn | unshuffled_original_mzn | 1055 | 73,870 | 691K | unshuffled_deduplicated_mzn | 917 | 64,481 | 602K | | Minangkabau | min | unshuffled_original_min | 220 | 5,682 | 608K | unshuffled_deduplicated_min | 166 | 4,825 | 310K | | Mingrelian | xmf | unshuffled_original_xmf | 3783 | 299,098 | 5.8M | unshuffled_deduplicated_xmf | 2418 | 228,629 | 4.4M | | Mirandese | mwl | unshuffled_original_mwl | 8 | 171 | 1.2K | unshuffled_deduplicated_mwl | 7 | 152 | 1.1K | | Modern Greek | el | unshuffled_original_el | 10425596 | 5,479,180,137 | 62G | unshuffled_deduplicated_el | 6521169 | 2,412,419,435 | 27G | | Mongolian | mn | unshuffled_original_mn | 395605 | 181,307,167 | 2.2G | unshuffled_deduplicated_mn | 197878 | 68,362,013 | 838M | | Nahuatl languages | nah | unshuffled_original_nah | 61 | 1,234 | 12K | unshuffled_deduplicated_nah | 58 | 1,193 | 11K | | Neapolitan | nap | unshuffled_original_nap | 73 | 5,282 | 17K | unshuffled_deduplicated_nap | 55 | 4,147 | 13K | | Nepali | ne | unshuffled_original_ne | 299938 | 107,448,208 | 1.8G | unshuffled_deduplicated_ne | 219334 | 71,628,317 | 1.2G | | Newari | new | unshuffled_original_new | 4696 | 564,697 | 5.5M | unshuffled_deduplicated_new | 2126 | 288,995 | 4.1M | | Northern Frisian | frr | unshuffled_original_frr | 7 | 1,516 | 4.4K | unshuffled_deduplicated_frr | 7 | 1,516 | 4.4K | | Northern Luri | lrc | unshuffled_original_lrc | 88 | 8,022 | 76K | unshuffled_deduplicated_lrc | 72 | 6,740 | 63K | | Norwegian | no | unshuffled_original_no | 5546211 | 1,344,326,388 | 8.0G | unshuffled_deduplicated_no | 3229940 | 804,894,377 | 4.7G | | Norwegian Nynorsk | nn | unshuffled_original_nn | 185884 | 14,764,980 | 85M | unshuffled_deduplicated_nn | 109118 | 9,435,139 | 54M | | Occitan | oc | unshuffled_original_oc | 10709 | 750,301 | 5.8M | unshuffled_deduplicated_oc | 6485 | 512,678 | 3.7M | | Oriya | or | unshuffled_original_or | 59463 | 14,938,567 | 248M | unshuffled_deduplicated_or | 44230 | 11,321,740 | 188M | | Ossetian | os | unshuffled_original_os | 5213 | 1,031,268 | 13M | unshuffled_deduplicated_os | 2559 | 878,765 | 11M | | Pampanga | pam | unshuffled_original_pam | 3 | 130 | 760 | unshuffled_deduplicated_pam | 1 | 52 | 304 | | Panjabi | pa | unshuffled_original_pa | 127467 | 61,847,806 | 763M | unshuffled_deduplicated_pa | 87235 | 37,555,835 | 460M | | Persian | fa | unshuffled_original_fa | 13704702 | 9,096,554,121 | 79G | unshuffled_deduplicated_fa | 8203495 | 4,363,505,319 | 38G | | Piemontese | pms | unshuffled_original_pms | 3225 | 362,013 | 2.1M | unshuffled_deduplicated_pms | 2859 | 337,246 | 1.9M | | Polish | pl | unshuffled_original_pl | 35440972 | 15,277,255,137 | 109G | unshuffled_deduplicated_pl | 20682611 | 6,708,709,674 | 47G | | Portuguese | pt | unshuffled_original_pt | 42114520 | 20,641,903,898 | 124G | unshuffled_deduplicated_pt | 26920397 | 10,751,156,918 | 64G | | Pushto | ps | unshuffled_original_ps | 98216 | 46,559,441 | 361M | unshuffled_deduplicated_ps | 67921 | 31,347,348 | 242M | | Quechua | qu | unshuffled_original_qu | 452 | 10,186 | 78K | unshuffled_deduplicated_qu | 411 | 8,691 | 67K | | Romanian | ro | unshuffled_original_ro | 9387265 | 3,984,317,058 | 25G | unshuffled_deduplicated_ro | 5044757 | 1,741,794,069 | 11G | | Romansh | rm | unshuffled_original_rm | 41 | 1,093 | 7.4K | unshuffled_deduplicated_rm | 34 | 960 | 6.5K | | Russia Buriat | bxr | unshuffled_original_bxr | 42 | 963 | 13K | unshuffled_deduplicated_bxr | 36 | 809 | 11K | | Russian | ru | unshuffled_original_ru | 161836003 | 92,522,407,837 | 1.2T | unshuffled_deduplicated_ru | 115954598 | 46,692,691,520 | 568G | | Sanskrit | sa | unshuffled_original_sa | 14291 | 4,331,569 | 93M | unshuffled_deduplicated_sa | 7121 | 1,713,930 | 37M | | Scottish Gaelic | gd | unshuffled_original_gd | 5799 | 310,689 | 1.9M | unshuffled_deduplicated_gd | 3883 | 207,110 | 1.3M | | Serbian | sr | unshuffled_original_sr | 1013619 | 364,395,411 | 3.9G | unshuffled_deduplicated_sr | 645747 | 207,561,168 | 2.2G | | Serbo-Croatian | sh | unshuffled_original_sh | 36700 | 5,292,184 | 25M | unshuffled_deduplicated_sh | 17610 | 1,040,573 | 5.8M | | Sicilian | scn | unshuffled_original_scn | 21 | 554 | 3.3K | unshuffled_deduplicated_scn | 17 | 468 | 2.8K | | Sindhi | sd | unshuffled_original_sd | 44280 | 43,530,158 | 347M | unshuffled_deduplicated_sd | 33925 | 33,028,015 | 263M | | Sinhala | si | unshuffled_original_si | 203082 | 93,053,465 | 1.4G | unshuffled_deduplicated_si | 120684 | 50,864,857 | 802M | | Slovak | sk | unshuffled_original_sk | 5492194 | 1,322,247,763 | 9.1G | unshuffled_deduplicated_sk | 2820821 | 656,346,179 | 4.5G | | Slovenian | sl | unshuffled_original_sl | 1746604 | 387,399,700 | 2.5G | unshuffled_deduplicated_sl | 886223 | 193,926,684 | 1.3G | | Somali | so | unshuffled_original_so | 156 | 1,202 | 61K | unshuffled_deduplicated_so | 42 | 472 | 16K | | South Azerbaijani | azb | unshuffled_original_azb | 15446 | 2,175,054 | 27M | unshuffled_deduplicated_azb | 9985 | 1,528,709 | 19M | | Spanish | es | unshuffled_original_es | 88199221 | 47,545,122,279 | 278G | unshuffled_deduplicated_es | 56326016 | 25,928,290,729 | 149G | | Sundanese | su | unshuffled_original_su | 805 | 30,321 | 211K | unshuffled_deduplicated_su | 511 | 20,278 | 141K | | Swahili | sw | unshuffled_original_sw | 41986 | 2,211,927 | 13M | unshuffled_deduplicated_sw | 24803 | 1,376,963 | 8.1M | | Swedish | sv | unshuffled_original_sv | 17395625 | 7,155,994,312 | 44G | unshuffled_deduplicated_sv | 11014487 | 4,106,120,608 | 25G | | Tagalog | tl | unshuffled_original_tl | 458206 | 98,949,299 | 573M | unshuffled_deduplicated_tl | 294132 | 70,121,601 | 407M | | Tajik | tg | unshuffled_original_tg | 89002 | 31,758,142 | 379M | unshuffled_deduplicated_tg | 56259 | 21,029,893 | 249M | | Tamil | ta | unshuffled_original_ta | 1263280 | 420,537,132 | 9.3G | unshuffled_deduplicated_ta | 833101 | 226,013,330 | 5.1G | | Tatar | tt | unshuffled_original_tt | 135923 | 51,034,893 | 670M | unshuffled_deduplicated_tt | 82738 | 23,825,695 | 305M | | Telugu | te | unshuffled_original_te | 475703 | 123,711,517 | 2.5G | unshuffled_deduplicated_te | 312644 | 79,094,167 | 1.6G | | Thai | th | unshuffled_original_th | 6064129 | 951,743,087 | 36G | unshuffled_deduplicated_th | 3749826 | 368,965,202 | 16G | | Tibetan | bo | unshuffled_original_bo | 26795 | 1,483,589 | 187M | unshuffled_deduplicated_bo | 15762 | 936,556 | 138M | | Turkish | tr | unshuffled_original_tr | 18535253 | 7,577,388,700 | 60G | unshuffled_deduplicated_tr | 11596446 | 3,365,734,289 | 27G | | Turkmen | tk | unshuffled_original_tk | 6456 | 1,113,869 | 11M | unshuffled_deduplicated_tk | 4694 | 752,326 | 6.8M | | Tuvinian | tyv | unshuffled_original_tyv | 34 | 759 | 12K | unshuffled_deduplicated_tyv | 24 | 540 | 7.9K | | Uighur | ug | unshuffled_original_ug | 22255 | 8,657,141 | 122M | unshuffled_deduplicated_ug | 15503 | 5,852,225 | 83M | | Ukrainian | uk | unshuffled_original_uk | 12973467 | 4,204,381,276 | 53G | unshuffled_deduplicated_uk | 7782375 | 2,252,380,351 | 28G | | Upper Sorbian | hsb | unshuffled_original_hsb | 7959 | 545,351 | 4.2M | unshuffled_deduplicated_hsb | 3084 | 236,867 | 1.8M | | Urdu | ur | unshuffled_original_ur | 638596 | 331,817,982 | 2.7G | unshuffled_deduplicated_ur | 428674 | 218,030,228 | 1.7G | | Uzbek | uz | unshuffled_original_uz | 27537 | 2,450,256 | 21M | unshuffled_deduplicated_uz | 15074 | 1,381,644 | 12M | | Venetian | vec | unshuffled_original_vec | 73 | 3,492 | 18K | unshuffled_deduplicated_vec | 64 | 3,199 | 17K | | Vietnamese | vi | unshuffled_original_vi | 14898250 | 12,036,845,359 | 68G | unshuffled_deduplicated_vi | 9897709 | 5,577,159,843 | 32G | | Volapük | vo | unshuffled_original_vo | 3366 | 321,121 | 2.0M | unshuffled_deduplicated_vo | 3317 | 318,568 | 2.0M | | Walloon | wa | unshuffled_original_wa | 1001 | 50,720 | 273K | unshuffled_deduplicated_wa | 677 | 37,543 | 203K | | Waray | war | unshuffled_original_war | 9760 | 397,315 | 2.5M | unshuffled_deduplicated_war | 9161 | 336,311 | 2.2M | | Welsh | cy | unshuffled_original_cy | 157698 | 37,422,441 | 213M | unshuffled_deduplicated_cy | 98225 | 23,574,673 | 133M | | Western Frisian | fy | unshuffled_original_fy | 33053 | 5,691,077 | 35M | unshuffled_deduplicated_fy | 20661 | 4,223,816 | 26M | | Western Mari | mrj | unshuffled_original_mrj | 757 | 93,338 | 1.2M | unshuffled_deduplicated_mrj | 669 | 87,780 | 1.1M | | Western Panjabi | pnb | unshuffled_original_pnb | 4599 | 1,426,986 | 12M | unshuffled_deduplicated_pnb | 3463 | 1,111,112 | 9.0M | | Wu Chinese | wuu | unshuffled_original_wuu | 214 | 11,189 | 109K | unshuffled_deduplicated_wuu | 64 | 4,333 | 32K | | Yakut | sah | unshuffled_original_sah | 22301 | 2,547,623 | 42M | unshuffled_deduplicated_sah | 8555 | 1,789,174 | 26M | | Yiddish | yi | unshuffled_original_yi | 59364 | 13,834,320 | 141M | unshuffled_deduplicated_yi | 32919 | 8,212,970 | 84M | | Yoruba | yo | unshuffled_original_yo | 214 | 8,906 | 55K | unshuffled_deduplicated_yo | 49 | 3,518 | 27K | | Yue Chinese | yue | unshuffled_original_yue | 11 | 186 | 3.7K | unshuffled_deduplicated_yue | 7 | 128 | 2.2K | </details> ## Dataset Creation ### Curation Rationale OSCAR was constructed new pipeline derived from the [fastText's one](https://github.com/facebookresearch/fastText), called [_goclassy_](https://github.com/pjox/goclassy). Goclassy reuses the [fastText linear classifier](https://fasttext.cc) and the pre-trained fastText model for language recognition, but it completely rewrites and parallelises their pipeline in an asynchronous manner. The order of operations is more or less the same as in the fastText pre-processing pipeline but instead of clustering multiple operations into a single blocking process, a worker is launched for each operation but bounding the number of possible parallel operations at a given time by the number of available threads instead of the number of CPUs. Goclassy is implemented in the [Go programming language](https://golang.org/) so it lets the [Go runtime](https://golang.org/src/runtime/mprof.go) handle the scheduling of the processes. Thus the goclassy's pipeline one does not have to wait for a whole WET file to download, decompress and classify in order to start downloading and processing the next one, a new file will start downloading and processing as soon as the scheduler is able to allocate a new process. Filtering and cleaning processes at line level are done before feeding each line to the classifier. Lines shorter than 100 UTF-8 characters and lines containing invalid UTF-8 characters are discarted and are not classified. After all files are proccesed the deduplicated versions are constructed and everything is then splitted in shards and compressed. ### Source Data #### Initial Data Collection and Normalization [Common Crawl](https://commoncrawl.org/) is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected [nofollow](http://microformats.org/wiki/rel-nofollow) and [robots.txt](https://www.robotstxt.org/) policies. Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics. To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR, the **November 2018** snapshot was used. It surpasses 20TB of uncompressed data and contains more than 50 thousand plain text files where each file consists of the plain text from multiple websites along its metadata header. #### Who are the source language producers? The data comes from multiple web pages in a large variety of languages. ### Annotations The dataset does not contain any additional annotations. #### Annotation process N/A #### Who are the annotators? N/A ### Personal and Sensitive Information Being constructed from Common Crawl, Personal and sensitive information might be present. This **must** be considered before training deep learning models with OSCAR, specially in the case of text-generation models. ## Considerations for Using the Data ### Social Impact of Dataset OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures. ### Discussion of Biases OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models. ### Other Known Limitations The [fastText linear classifier](https://fasttext.cc) is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by [third parties](https://arxiv.org/abs/2010.14571). ## Additional Information ### Dataset Curators The corpus was put together by [Pedro J. Ortiz](https://pjortiz.eu/), [Benoît Sagot](http://pauillac.inria.fr/~sagot/), and [Laurent Romary](https://cv.archives-ouvertes.fr/laurentromary), during work done at [Inria](https://www.inria.fr/en), particularly at the [ALMAnaCH team](https://team.inria.fr/almanach/). ### Licensing Information These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France. Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: * Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. * Clearly identify the copyrighted work claimed to be infringed. * Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. We will comply to legitimate requests by removing the affected sources from the next release of the corpus. ### Citation Information ``` @inproceedings{ortiz-suarez-etal-2020-monolingual, title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages", author = "Ortiz Su{'a}rez, Pedro Javier and Romary, Laurent and Sagot, Benoit", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.156", pages = "1703--1714", abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.", } @inproceedings{OrtizSuarezSagotRomary2019, author = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary}, title = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures}, series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019}, editor = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi}, publisher = {Leibniz-Institut f{"u}r Deutsche Sprache}, address = {Mannheim}, doi = {10.14618/ids-pub-9021}, url = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215}, pages = {9 -- 16}, year = {2019}, abstract = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.}, language = {en} } ``` ### Contributions Thanks to [@pjox](https://github.com/pjox) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.
提供机构:
oscar-corpus
原始信息汇总

数据集概述

数据集名称

  • pretty_name: OSCAR

语言和创建者信息

  • annotations_creators: no-annotation
  • language_creators: found
  • language: 多语言支持,包括但不限于 af, als, am, an, ar, arz, as, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, cs, cv, cy, da, de, diq, dsb, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, he, hi, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pl, pms, pnb, ps, pt, qu, rm, ro, ru, sa, sah, scn, sd, sh, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vi, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh

许可证

  • license: cc0-1.0

多语言性

  • multilinguality: multilingual

大小分类

  • size_categories:
    • 100K<n<1M
    • 100M<n<1B
    • 10K<n<100K
    • 10M<n<100M
    • 1K<n<10K
    • 1M<n<10M
    • n<1K

源数据集

  • source_datasets: original

任务类别

  • task_categories:
    • text-generation
    • fill-mask

任务ID

  • task_ids:
    • language-modeling
    • masked-language-modeling

论文链接ID

  • paperswithcode_id: oscar

数据集详细配置

配置名称: unshuffled_deduplicated_af

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 171320914
      • num_examples: 130640
    • download_size: 65989254
    • dataset_size: 171320914

配置名称: unshuffled_deduplicated_als

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 2915912
      • num_examples: 4518
    • download_size: 1263294
    • dataset_size: 2915912

配置名称: unshuffled_deduplicated_arz

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 34893248
      • num_examples: 79928
    • download_size: 10027493
    • dataset_size: 34893248

配置名称: unshuffled_deduplicated_an

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 842246
      • num_examples: 2025
    • download_size: 133373
    • dataset_size: 842246

配置名称: unshuffled_deduplicated_ast

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 2150022
      • num_examples: 5343
    • download_size: 856177
    • dataset_size: 2150022

配置名称: unshuffled_deduplicated_ba

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 93623739
      • num_examples: 27050
    • download_size: 25983491
    • dataset_size: 93623739

配置名称: unshuffled_deduplicated_am

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 215618603
      • num_examples: 43102
    • download_size: 61347279
    • dataset_size: 215618603

配置名称: unshuffled_deduplicated_as

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 73989818
      • num_examples: 9212
    • download_size: 15513004
    • dataset_size: 73989818

配置名称: unshuffled_deduplicated_azb

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 20001183
      • num_examples: 9985
    • download_size: 5191704
    • dataset_size: 20001183

配置名称: unshuffled_deduplicated_be

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 1077152244
      • num_examples: 307405
    • download_size: 306700943
    • dataset_size: 1077152244

配置名称: unshuffled_deduplicated_bo

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 144506264
      • num_examples: 15762
    • download_size: 22365048
    • dataset_size: 144506264

配置名称: unshuffled_deduplicated_bxr

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 11325
      • num_examples: 36
    • download_size: 3666
    • dataset_size: 11325

配置名称: unshuffled_deduplicated_ceb

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 24439249
      • num_examples: 26145
    • download_size: 7124786
    • dataset_size: 24439249

配置名称: unshuffled_deduplicated_az

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 1526935070
      • num_examples: 626796
    • download_size: 521744076
    • dataset_size: 1526935070

配置名称: unshuffled_deduplicated_bcl

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 900
      • num_examples: 1
    • download_size: 594
    • dataset_size: 900

配置名称: unshuffled_deduplicated_cy

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 140412555
      • num_examples: 98225
    • download_size: 53629697
    • dataset_size: 140412555

配置名称: unshuffled_deduplicated_dsb

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 7589
      • num_examples: 37
    • download_size: 3640
    • dataset_size: 7589

配置名称: unshuffled_deduplicated_bn

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 6233041155
      • num_examples: 1114481
    • download_size: 1257218381
    • dataset_size: 6233041155

配置名称: unshuffled_deduplicated_bs

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 125977
      • num_examples: 702
    • download_size: 38669
    • dataset_size: 125977

配置名称: unshuffled_deduplicated_ce

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 7021674
      • num_examples: 2984
    • download_size: 1862792
    • dataset_size: 7021674

配置名称: unshuffled_deduplicated_cv

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 27359554
      • num_examples: 10130
    • download_size: 7461982
    • dataset_size: 27359554

配置名称: unshuffled_deduplicated_diq

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 161
      • num_examples: 1
    • download_size: 331
    • dataset_size: 161

配置名称: unshuffled_deduplicated_eml

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 24657
      • num_examples: 80
    • download_size: 10055
    • dataset_size: 24657

配置名称: unshuffled_deduplicated_et

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 2434152666
      • num_examples: 1172041
    • download_size: 966785545
    • dataset_size: 2434152666

配置名称: unshuffled_deduplicated_bg

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 14420684170
      • num_examples: 3398679
    • download_size: 3848659853
    • dataset_size: 14420684170

配置名称: unshuffled_deduplicated_bpy

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 1725535
      • num_examples: 1770
    • download_size: 191472
    • dataset_size: 1725535

配置名称: unshuffled_deduplicated_ca

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 4544123629
      • num_examples: 2458067
    • download_size: 1734548117
    • dataset_size: 4544123629

配置名称: unshuffled_deduplicated_ckb

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 237229156
      • num_examples: 68210
    • download_size: 60319928
    • dataset_size: 237229156

配置名称: unshuffled_deduplicated_ar

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 33468271639
      • num_examples: 9006977
    • download_size: 9667185012
    • dataset_size: 33468271639

配置名称: unshuffled_deduplicated_av

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 334755
      • num_examples: 360
    • download_size: 75341
    • dataset_size: 334755

配置名称: unshuffled_deduplicated_bar

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 551
      • num_examples: 4
    • download_size: 354
    • dataset_size: 551

配置名称: unshuffled_deduplicated_bh

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 35216
      • num_examples: 82
    • download_size: 6003
    • dataset_size: 35216

配置名称: unshuffled_deduplicated_br

  • features:
    • id: int64
    • text: string
  • splits:
    • train
      • num_bytes: 16712284
      • num_examples: 14724
    • download_size: 646806
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
OSCAR是一个庞大的多语言语料库,基于Common Crawl构建,包含166种语言的原始和去重版本。该数据集主要用于预训练语言模型和词表示,但需要注意其内容版权归原始作者所有,且目前访问受限。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作