Polygl0t/mc4-50k
收藏Hugging Face2026-01-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/mc4-50k
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: af
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 211272319
num_examples: 50000
download_size: 79220502
dataset_size: 211272319
- config_name: am
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 604626916
num_examples: 50000
download_size: 202248549
dataset_size: 604626916
- config_name: ar
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 362524922
num_examples: 50000
download_size: 117444521
dataset_size: 362524922
- config_name: az
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 242819371
num_examples: 50000
download_size: 87352644
dataset_size: 242819371
- config_name: be
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 415391967
num_examples: 50000
download_size: 138308460
dataset_size: 415391967
- config_name: bg
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 365917287
num_examples: 50000
download_size: 118076336
dataset_size: 365917287
- config_name: bg-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 165078549
num_examples: 50000
download_size: 51304732
dataset_size: 165078549
- config_name: bn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 424341506
num_examples: 50000
download_size: 118072815
dataset_size: 424341506
- config_name: ca
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 244692891
num_examples: 50000
download_size: 92429243
dataset_size: 244692891
- config_name: ceb
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 165900131
num_examples: 50000
download_size: 54513013
dataset_size: 165900131
- config_name: co
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 107306338
num_examples: 50000
download_size: 36943992
dataset_size: 107306338
- config_name: cs
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 272769647
num_examples: 50000
download_size: 105198713
dataset_size: 272769647
- config_name: cy
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 378684171
num_examples: 50000
download_size: 126179213
dataset_size: 378684171
- config_name: da
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 284072138
num_examples: 50000
download_size: 104032126
dataset_size: 284072138
- config_name: de
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 266450698
num_examples: 50000
download_size: 99756703
dataset_size: 266450698
- config_name: el
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 369582896
num_examples: 50000
download_size: 120415765
dataset_size: 369582896
- config_name: el-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 195392591
num_examples: 50000
download_size: 71301162
dataset_size: 195392591
- config_name: en
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 173659364
num_examples: 50000
download_size: 66082676
dataset_size: 173659364
- config_name: eo
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 362420313
num_examples: 50000
download_size: 137363886
dataset_size: 362420313
- config_name: es
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 298247914
num_examples: 50000
download_size: 111607368
dataset_size: 298247914
- config_name: et
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 280674107
num_examples: 50000
download_size: 107584098
dataset_size: 280674107
- config_name: eu
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 240249498
num_examples: 50000
download_size: 90080710
dataset_size: 240249498
- config_name: fa
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 337800853
num_examples: 50000
download_size: 105831457
dataset_size: 337800853
- config_name: fi
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 270098057
num_examples: 50000
download_size: 102294813
dataset_size: 270098057
- config_name: fil
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 268830088
num_examples: 50000
download_size: 98795600
dataset_size: 268830088
- config_name: fr
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 260400815
num_examples: 50000
download_size: 95793038
dataset_size: 260400815
- config_name: fy
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 89290945
num_examples: 50000
download_size: 31553197
dataset_size: 89290945
- config_name: ga
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 230304562
num_examples: 50000
download_size: 83584584
dataset_size: 230304562
- config_name: gd
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 277718144
num_examples: 50000
download_size: 96468338
dataset_size: 277718144
- config_name: gl
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 142526481
num_examples: 50000
download_size: 53491409
dataset_size: 142526481
- config_name: gu
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 466897688
num_examples: 50000
download_size: 133303578
dataset_size: 466897688
- config_name: ha
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 191051242
num_examples: 50000
download_size: 69375460
dataset_size: 191051242
- config_name: haw
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 212280095
num_examples: 50000
download_size: 75169396
dataset_size: 212280095
- config_name: hi
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 522236625
num_examples: 50000
download_size: 155438533
dataset_size: 522236625
- config_name: hi-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 225029072
num_examples: 50000
download_size: 81096447
dataset_size: 225029072
- config_name: hmn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 200073318
num_examples: 50000
download_size: 67492426
dataset_size: 200073318
- config_name: ht
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 174761209
num_examples: 50000
download_size: 63357361
dataset_size: 174761209
- config_name: hu
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 292550440
num_examples: 50000
download_size: 110680340
dataset_size: 292550440
- config_name: hy
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 354822128
num_examples: 50000
download_size: 112086055
dataset_size: 354822128
- config_name: id
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 283107717
num_examples: 50000
download_size: 97527211
dataset_size: 283107717
- config_name: ig
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 213231308
num_examples: 50000
download_size: 71937237
dataset_size: 213231308
- config_name: is
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 314581102
num_examples: 50000
download_size: 116340097
dataset_size: 314581102
- config_name: it
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 241646764
num_examples: 50000
download_size: 91274760
dataset_size: 241646764
- config_name: iw
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 437877419
num_examples: 50000
download_size: 144951072
dataset_size: 437877419
- config_name: ja
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 713024245
num_examples: 50000
download_size: 245407179
dataset_size: 713024245
- config_name: ja-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 128539736
num_examples: 50000
download_size: 45756043
dataset_size: 128539736
- config_name: jv
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 113173066
num_examples: 50000
download_size: 39232299
dataset_size: 113173066
- config_name: ka
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 511786383
num_examples: 50000
download_size: 136913895
dataset_size: 511786383
- config_name: kk
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 543789684
num_examples: 50000
download_size: 168278412
dataset_size: 543789684
- config_name: km
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 475029500
num_examples: 50000
download_size: 125467614
dataset_size: 475029500
- config_name: kn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 503370923
num_examples: 50000
download_size: 140449132
dataset_size: 503370923
- config_name: ko
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 478071878
num_examples: 50000
download_size: 164515369
dataset_size: 478071878
- config_name: ku
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 322952022
num_examples: 50000
download_size: 113967980
dataset_size: 322952022
- config_name: ky
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 382295219
num_examples: 50000
download_size: 124588930
dataset_size: 382295219
- config_name: la
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 201490949
num_examples: 50000
download_size: 70954840
dataset_size: 201490949
- config_name: lb
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 86013453
num_examples: 50000
download_size: 28208426
dataset_size: 86013453
- config_name: lo
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 487578306
num_examples: 50000
download_size: 133742177
dataset_size: 487578306
- config_name: lt
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 277052530
num_examples: 50000
download_size: 105302920
dataset_size: 277052530
- config_name: lv
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 292700812
num_examples: 50000
download_size: 108434503
dataset_size: 292700812
- config_name: mg
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 153647225
num_examples: 50000
download_size: 54675034
dataset_size: 153647225
- config_name: mi
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 274175282
num_examples: 50000
download_size: 96384965
dataset_size: 274175282
- config_name: mk
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 343480782
num_examples: 50000
download_size: 108720473
dataset_size: 343480782
- config_name: ml
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 518336007
num_examples: 50000
download_size: 139092161
dataset_size: 518336007
- config_name: mn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 449863510
num_examples: 50000
download_size: 148835026
dataset_size: 449863510
- config_name: mr
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 725826566
num_examples: 50000
download_size: 238855797
dataset_size: 725826566
- config_name: ms
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 288037901
num_examples: 50000
download_size: 106743822
dataset_size: 288037901
- config_name: mt
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 430323312
num_examples: 50000
download_size: 144741456
dataset_size: 430323312
- config_name: my
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 652055128
num_examples: 50000
download_size: 167829340
dataset_size: 652055128
- config_name: ne
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 429609574
num_examples: 50000
download_size: 124772931
dataset_size: 429609574
- config_name: nl
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 220359378
num_examples: 50000
download_size: 81417335
dataset_size: 220359378
- config_name: "no"
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 300708727
num_examples: 50000
download_size: 111677166
dataset_size: 300708727
- config_name: ny
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 182554696
num_examples: 50000
download_size: 66196926
dataset_size: 182554696
- config_name: pa
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 510193732
num_examples: 50000
download_size: 143408374
dataset_size: 510193732
- config_name: pl
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 271771724
num_examples: 50000
download_size: 104557441
dataset_size: 271771724
- config_name: ps
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 345174516
num_examples: 50000
download_size: 111502551
dataset_size: 345174516
- config_name: pt
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 234257106
num_examples: 50000
download_size: 87647760
dataset_size: 234257106
- config_name: ro
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 321857111
num_examples: 50000
download_size: 121342914
dataset_size: 321857111
- config_name: ru
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 402382110
num_examples: 50000
download_size: 130895100
dataset_size: 402382110
- config_name: ru-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 274419616
num_examples: 50000
download_size: 87413208
dataset_size: 274419616
- config_name: sd
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 536562558
num_examples: 50000
download_size: 159870791
dataset_size: 536562558
- config_name: si
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 674982307
num_examples: 50000
download_size: 194906837
dataset_size: 674982307
- config_name: sk
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 264951862
num_examples: 50000
download_size: 102552156
dataset_size: 264951862
- config_name: sl
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 273670472
num_examples: 50000
download_size: 105822363
dataset_size: 273670472
- config_name: sm
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 202645025
num_examples: 50000
download_size: 69950533
dataset_size: 202645025
- config_name: sn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 135234592
num_examples: 50000
download_size: 50926433
dataset_size: 135234592
- config_name: so
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 395318975
num_examples: 50000
download_size: 133125283
dataset_size: 395318975
- config_name: sq
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 249195815
num_examples: 50000
download_size: 93056813
dataset_size: 249195815
- config_name: sr
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 483971141
num_examples: 50000
download_size: 159169044
dataset_size: 483971141
- config_name: st
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 285696617
num_examples: 50000
download_size: 102560831
dataset_size: 285696617
- config_name: su
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 121426551
num_examples: 50000
download_size: 43406189
dataset_size: 121426551
- config_name: sv
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 274579596
num_examples: 50000
download_size: 101662609
dataset_size: 274579596
- config_name: sw
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 262885836
num_examples: 50000
download_size: 96966447
dataset_size: 262885836
- config_name: ta
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 590015380
num_examples: 50000
download_size: 156103442
dataset_size: 590015380
- config_name: te
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 503110175
num_examples: 50000
download_size: 141360798
dataset_size: 503110175
- config_name: tg
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 427570044
num_examples: 50000
download_size: 134398687
dataset_size: 427570044
- config_name: th
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 397905102
num_examples: 50000
download_size: 108046100
dataset_size: 397905102
- config_name: tr
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 241247967
num_examples: 50000
download_size: 89078724
dataset_size: 241247967
- config_name: uk
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 433152355
num_examples: 50000
download_size: 140279932
dataset_size: 433152355
- config_name: und
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 96486081
num_examples: 50000
download_size: 34143693
dataset_size: 96486081
- config_name: ur
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 410698577
num_examples: 50000
download_size: 132678875
dataset_size: 410698577
- config_name: uz
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 305775337
num_examples: 50000
download_size: 111000865
dataset_size: 305775337
- config_name: vi
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 307230640
num_examples: 50000
download_size: 103988847
dataset_size: 307230640
- config_name: xh
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 221082900
num_examples: 50000
download_size: 82079831
dataset_size: 221082900
- config_name: yi
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 537757264
num_examples: 50000
download_size: 164598999
dataset_size: 537757264
- config_name: yo
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 239404099
num_examples: 46214
download_size: 83939345
dataset_size: 239404099
- config_name: zh
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 225685625
num_examples: 50000
download_size: 83255576
dataset_size: 225685625
- config_name: zh-Latn
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 135963615
num_examples: 50000
download_size: 48648499
dataset_size: 135963615
- config_name: zu
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 106940541
num_examples: 50000
download_size: 39213844
dataset_size: 106940541
configs:
- config_name: af
data_files:
- split: train
path: af/train-*
- config_name: am
data_files:
- split: train
path: am/train-*
- config_name: ar
data_files:
- split: train
path: ar/train-*
- config_name: az
data_files:
- split: train
path: az/train-*
- config_name: be
data_files:
- split: train
path: be/train-*
- config_name: bg
data_files:
- split: train
path: bg/train-*
- config_name: bg-Latn
data_files:
- split: train
path: bg-Latn/train-*
- config_name: bn
data_files:
- split: train
path: bn/train-*
- config_name: ca
data_files:
- split: train
path: ca/train-*
- config_name: ceb
data_files:
- split: train
path: ceb/train-*
- config_name: co
data_files:
- split: train
path: co/train-*
- config_name: cs
data_files:
- split: train
path: cs/train-*
- config_name: cy
data_files:
- split: train
path: cy/train-*
- config_name: da
data_files:
- split: train
path: da/train-*
- config_name: de
data_files:
- split: train
path: de/train-*
- config_name: el
data_files:
- split: train
path: el/train-*
- config_name: el-Latn
data_files:
- split: train
path: el-Latn/train-*
- config_name: en
data_files:
- split: train
path: en/train-*
- config_name: eo
data_files:
- split: train
path: eo/train-*
- config_name: es
data_files:
- split: train
path: es/train-*
- config_name: et
data_files:
- split: train
path: et/train-*
- config_name: eu
data_files:
- split: train
path: eu/train-*
- config_name: fa
data_files:
- split: train
path: fa/train-*
- config_name: fi
data_files:
- split: train
path: fi/train-*
- config_name: fil
data_files:
- split: train
path: fil/train-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- config_name: fy
data_files:
- split: train
path: fy/train-*
- config_name: ga
data_files:
- split: train
path: ga/train-*
- config_name: gd
data_files:
- split: train
path: gd/train-*
- config_name: gl
data_files:
- split: train
path: gl/train-*
- config_name: gu
data_files:
- split: train
path: gu/train-*
- config_name: ha
data_files:
- split: train
path: ha/train-*
- config_name: haw
data_files:
- split: train
path: haw/train-*
- config_name: hi
data_files:
- split: train
path: hi/train-*
- config_name: hi-Latn
data_files:
- split: train
path: hi-Latn/train-*
- config_name: hmn
data_files:
- split: train
path: hmn/train-*
- config_name: ht
data_files:
- split: train
path: ht/train-*
- config_name: hu
data_files:
- split: train
path: hu/train-*
- config_name: hy
data_files:
- split: train
path: hy/train-*
- config_name: id
data_files:
- split: train
path: id/train-*
- config_name: ig
data_files:
- split: train
path: ig/train-*
- config_name: is
data_files:
- split: train
path: is/train-*
- config_name: it
data_files:
- split: train
path: it/train-*
- config_name: iw
data_files:
- split: train
path: iw/train-*
- config_name: ja
data_files:
- split: train
path: ja/train-*
- config_name: ja-Latn
data_files:
- split: train
path: ja-Latn/train-*
- config_name: jv
data_files:
- split: train
path: jv/train-*
- config_name: ka
data_files:
- split: train
path: ka/train-*
- config_name: kk
data_files:
- split: train
path: kk/train-*
- config_name: km
data_files:
- split: train
path: km/train-*
- config_name: kn
data_files:
- split: train
path: kn/train-*
- config_name: ko
data_files:
- split: train
path: ko/train-*
- config_name: ku
data_files:
- split: train
path: ku/train-*
- config_name: ky
data_files:
- split: train
path: ky/train-*
- config_name: la
data_files:
- split: train
path: la/train-*
- config_name: lb
data_files:
- split: train
path: lb/train-*
- config_name: lo
data_files:
- split: train
path: lo/train-*
- config_name: lt
data_files:
- split: train
path: lt/train-*
- config_name: lv
data_files:
- split: train
path: lv/train-*
- config_name: mg
data_files:
- split: train
path: mg/train-*
- config_name: mi
data_files:
- split: train
path: mi/train-*
- config_name: mk
data_files:
- split: train
path: mk/train-*
- config_name: ml
data_files:
- split: train
path: ml/train-*
- config_name: mn
data_files:
- split: train
path: mn/train-*
- config_name: mr
data_files:
- split: train
path: mr/train-*
- config_name: ms
data_files:
- split: train
path: ms/train-*
- config_name: mt
data_files:
- split: train
path: mt/train-*
- config_name: my
data_files:
- split: train
path: my/train-*
- config_name: ne
data_files:
- split: train
path: ne/train-*
- config_name: nl
data_files:
- split: train
path: nl/train-*
- config_name: "no"
data_files:
- split: train
path: no/train-*
- config_name: ny
data_files:
- split: train
path: ny/train-*
- config_name: pa
data_files:
- split: train
path: pa/train-*
- config_name: pl
data_files:
- split: train
path: pl/train-*
- config_name: ps
data_files:
- split: train
path: ps/train-*
- config_name: pt
default: true
data_files:
- split: train
path: pt/train-*
- config_name: ro
data_files:
- split: train
path: ro/train-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- config_name: ru-Latn
data_files:
- split: train
path: ru-Latn/train-*
- config_name: sd
data_files:
- split: train
path: sd/train-*
- config_name: si
data_files:
- split: train
path: si/train-*
- config_name: sk
data_files:
- split: train
path: sk/train-*
- config_name: sl
data_files:
- split: train
path: sl/train-*
- config_name: sm
data_files:
- split: train
path: sm/train-*
- config_name: sn
data_files:
- split: train
path: sn/train-*
- config_name: so
data_files:
- split: train
path: so/train-*
- config_name: sq
data_files:
- split: train
path: sq/train-*
- config_name: sr
data_files:
- split: train
path: sr/train-*
- config_name: st
data_files:
- split: train
path: st/train-*
- config_name: su
data_files:
- split: train
path: su/train-*
- config_name: sv
data_files:
- split: train
path: sv/train-*
- config_name: sw
data_files:
- split: train
path: sw/train-*
- config_name: ta
data_files:
- split: train
path: ta/train-*
- config_name: te
data_files:
- split: train
path: te/train-*
- config_name: tg
data_files:
- split: train
path: tg/train-*
- config_name: th
data_files:
- split: train
path: th/train-*
- config_name: tr
data_files:
- split: train
path: tr/train-*
- config_name: uk
data_files:
- split: train
path: uk/train-*
- config_name: und
data_files:
- split: train
path: und/train-*
- config_name: ur
data_files:
- split: train
path: ur/train-*
- config_name: uz
data_files:
- split: train
path: uz/train-*
- config_name: vi
data_files:
- split: train
path: vi/train-*
- config_name: xh
data_files:
- split: train
path: xh/train-*
- config_name: yi
data_files:
- split: train
path: yi/train-*
- config_name: yo
data_files:
- split: train
path: yo/train-*
- config_name: zh
data_files:
- split: train
path: zh/train-*
- config_name: zh-Latn
data_files:
- split: train
path: zh-Latn/train-*
- config_name: zu
data_files:
- split: train
path: zu/train-*
---
# mC4 (50K samples per language)
## Dataset Description
- **Paper:** https://arxiv.org/abs/1910.10683
### Dataset Summary
**This is the multilingual version of the C4 dataset. We only keep 50,000 samples per language for all the 108 languages available. This is intended for educational and experimentation purposes. For the full dataset, please refer to the `multilingual` config of the `allenai/c4` dataset.**
#### How do I download this?
##### Using 🤗 Datasets
```python
from datasets import load_dataset
# Portuguese only
pt = load_dataset("Polygl0t/mc4-50k", "pt")
# All languages (108)
mc4 = load_dataset("Polygl0t/mc4-50k")
```
You can also load and mix multiple languages:
```python
from datasets import concatenate_datasets, interleave_datasets, load_dataset
es = load_dataset("Polygl0t/mc4-50k", "es", streaming=True)
fr = load_dataset("Polygl0t/mc4-50k", "fr", streaming=True)
# Concatenate both datasets
concatenated = concatenate_datasets([es, fr])
# Or interleave them (alternates between one and the other)
interleaved = interleave_datasets([es, fr])
```
#### Languages
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
| language code | language name |
| :------------ | :------------------- |
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bg-Latn | Bulgarian (Latin) |
| bn | Bangla |
| ca | Catalan |
| ceb | Cebuano |
| co | Corsican |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| el-Latn | Greek (Latin) |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fil | Filipino |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| haw | Hawaiian |
| hi | Hindi |
| hi-Latn | Hindi (Latin script) |
| hmn | Hmong, Mong |
| ht | Haitian |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| iw | former Hebrew |
| ja | Japanese |
| ja-Latn | Japanese (Latin) |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kyrgyz |
| la | Latin |
| lb | Luxembourgish |
| lo | Lao |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mi | Maori |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| ny | Nyanja |
| pa | Punjabi |
| pl | Polish |
| ps | Pashto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| ru-Latn | Russian (Latin) |
| sd | Sindhi |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sm | Samoan |
| sn | Shona |
| so | Somali |
| sq | Albanian |
| sr | Serbian |
| st | Southern Sotho |
| su | Sundanese |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| und | Unknown language |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zh-Latn | Chinese (Latin) |
| zu | Zulu |
## Dataset Structure
### Data Instances
An example form the `en` config is:
```
{
'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'
}
```
### Data Fields
The data have several fields:
- `text`: text content as a string
### Source Data
#### Initial Data Collection and Normalization
The C4 and mC4 datasets are collections text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication.
### Licensing Information
We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.
提供机构:
Polygl0t



