deepdml/cv22-neucodec
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/cv22-neucodec
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: af
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 234215
num_examples: 139
- name: validation
num_bytes: 210750
num_examples: 125
- name: test
num_bytes: 211704
num_examples: 131
- name: other
num_bytes: 530539
num_examples: 306
download_size: 2916207
dataset_size: 1187208
- config_name: am
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 862126
num_examples: 523
- name: validation
num_bytes: 388373
num_examples: 248
- name: test
num_bytes: 422647
num_examples: 252
- name: other
num_bytes: 1036563
num_examples: 579
download_size: 4110632
dataset_size: 2709709
- config_name: ar
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 34082906
num_examples: 28531
- name: validation
num_bytes: 13185327
num_examples: 10503
- name: test
num_bytes: 12850962
num_examples: 10500
- name: other
num_bytes: 49125511
num_examples: 41364
download_size: 121907594
dataset_size: 109244706
- config_name: as
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1496770
num_examples: 952
- name: validation
num_bytes: 710085
num_examples: 485
- name: test
num_bytes: 596191
num_examples: 379
- name: other
num_bytes: 4192940
num_examples: 2557
download_size: 7931300
dataset_size: 6995986
- config_name: az
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 221414
num_examples: 157
- name: validation
num_bytes: 100869
num_examples: 78
- name: test
num_bytes: 150855
num_examples: 95
- name: other
num_bytes: 774746
num_examples: 529
download_size: 1825196
dataset_size: 1247884
- config_name: be
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 479701676
num_examples: 347672
- name: validation
num_bytes: 24574704
num_examples: 15879
- name: test
num_bytes: 24836588
num_examples: 15880
- name: other
num_bytes: 26024040
num_examples: 17002
download_size: 311423647
dataset_size: 555137008
- config_name: bg
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 7168900
num_examples: 4952
- name: validation
num_bytes: 4457729
num_examples: 2932
- name: test
num_bytes: 5084572
num_examples: 3354
- name: other
num_bytes: 2727148
num_examples: 1787
download_size: 10882523
dataset_size: 19438349
- config_name: bn
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 34739997
num_examples: 21514
- name: validation
num_bytes: 16186658
num_examples: 9382
- name: test
num_bytes: 16428595
num_examples: 9382
- name: other
num_bytes: 1303726999
num_examples: 999246
download_size: 741753000
dataset_size: 1371082249
- config_name: ca
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1746515510
num_examples: 1208213
- name: validation
num_bytes: 25973496
num_examples: 16414
- name: test
num_bytes: 25874994
num_examples: 16414
- name: other
num_bytes: 273541359
num_examples: 223303
download_size: 1161634677
dataset_size: 2071905359
- config_name: cs
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 28603672
num_examples: 21731
- name: validation
num_bytes: 11983611
num_examples: 9410
- name: test
num_bytes: 12023840
num_examples: 9421
- name: other
num_bytes: 190308760
num_examples: 149113
download_size: 136434186
dataset_size: 242919883
- config_name: cy
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 11146324
num_examples: 8014
- name: validation
num_bytes: 7901797
num_examples: 5408
- name: test
num_bytes: 7896561
num_examples: 5408
- name: other
num_bytes: 29824233
num_examples: 20676
download_size: 32546870
dataset_size: 56768915
- config_name: da
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 4282783
num_examples: 3602
- name: validation
num_bytes: 3388363
num_examples: 2630
- name: test
num_bytes: 3537954
num_examples: 2758
- name: other
num_bytes: 2558218
num_examples: 2215
download_size: 7839424
dataset_size: 13767318
- config_name: el
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2287740
num_examples: 1934
- name: validation
num_bytes: 2001629
num_examples: 1694
- name: test
num_bytes: 2110810
num_examples: 1711
- name: other
num_bytes: 12622896
num_examples: 10351
download_size: 10675779
dataset_size: 19023075
- config_name: en
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1713024261
num_examples: 1138760
- name: validation
num_bytes: 25491746
num_examples: 16400
- name: test
num_bytes: 25384592
num_examples: 16400
- name: other
num_bytes: 537774106
num_examples: 370671
download_size: 1313672547
dataset_size: 2301674705
- config_name: es
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 495961511
num_examples: 353701
- name: validation
num_bytes: 25061639
num_examples: 15893
- name: test
num_bytes: 25222726
num_examples: 15893
- name: other
num_bytes: 1513520844
num_examples: 1142320
download_size: 1161449826
dataset_size: 2059766720
- config_name: et
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 5975326
num_examples: 3402
- name: validation
num_bytes: 4847996
num_examples: 2823
- name: test
num_bytes: 4894333
num_examples: 2823
- name: other
num_bytes: 163980
num_examples: 107
download_size: 9461859
dataset_size: 15881635
- config_name: eu
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 195688979
num_examples: 130043
- name: validation
num_bytes: 23251143
num_examples: 14753
- name: test
num_bytes: 23434069
num_examples: 14753
- name: other
num_bytes: 173505583
num_examples: 115423
download_size: 238311864
dataset_size: 415879774
- config_name: fa
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 33310522
num_examples: 29789
- name: validation
num_bytes: 12875044
num_examples: 10676
- name: test
num_bytes: 14427351
num_examples: 10676
- name: other
num_bytes: 37098112
num_examples: 34503
download_size: 54844523
dataset_size: 97711029
- config_name: fi
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2598672
num_examples: 2093
- name: validation
num_bytes: 2249772
num_examples: 1767
- name: test
num_bytes: 2487415
num_examples: 1806
- name: other
num_bytes: 6610991
num_examples: 5078
download_size: 7995992
dataset_size: 13946850
- config_name: fr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 831402083
num_examples: 593066
- name: validation
num_bytes: 24707701
num_examples: 16186
- name: test
num_bytes: 24865963
num_examples: 16186
- name: other
num_bytes: 27765612
num_examples: 18829
download_size: 515858563
dataset_size: 908741359
- config_name: gl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 94575749
num_examples: 70039
- name: validation
num_bytes: 18366955
num_examples: 13443
- name: test
num_bytes: 19035550
num_examples: 13443
- name: other
num_bytes: 206135327
num_examples: 153838
download_size: 192249088
dataset_size: 338113581
- config_name: ha
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2287572
num_examples: 1908
- name: validation
num_bytes: 749972
num_examples: 623
- name: test
num_bytes: 960260
num_examples: 750
- name: other
num_bytes: 8102486
num_examples: 6668
download_size: 6848866
dataset_size: 12100290
- config_name: he
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1254839
num_examples: 1011
- name: validation
num_bytes: 894902
num_examples: 672
- name: test
num_bytes: 569890
num_examples: 392
- name: other
num_bytes: 3191039
num_examples: 2472
download_size: 3426207
dataset_size: 5910670
- config_name: hi
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 6261214
num_examples: 4869
- name: validation
num_bytes: 3687146
num_examples: 2700
- name: test
num_bytes: 4794833
num_examples: 3343
- name: other
num_bytes: 7121601
num_examples: 4449
download_size: 12271651
dataset_size: 21864794
- config_name: hu
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 55012325
num_examples: 39270
- name: validation
num_bytes: 16466797
num_examples: 11604
- name: test
num_bytes: 17257348
num_examples: 11659
- name: other
num_bytes: 77488850
num_examples: 50475
download_size: 94092317
dataset_size: 166225320
- config_name: hy-AM
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 13638764
num_examples: 9303
- name: validation
num_bytes: 8758504
num_examples: 5859
- name: test
num_bytes: 9039088
num_examples: 5823
- name: other
num_bytes: 22530952
num_examples: 15157
download_size: 30286498
dataset_size: 53967308
- config_name: ig
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 13084
num_examples: 9
- name: validation
num_bytes: 3940
num_examples: 3
- name: test
num_bytes: 7050
num_examples: 5
- name: other
num_bytes: 8379738
num_examples: 5784
download_size: 4800403
dataset_size: 8403812
- config_name: is
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 30244
num_examples: 17
- name: validation
num_bytes: 15662
num_examples: 9
- name: test
num_bytes: 16554
num_examples: 9
- name: other
num_bytes: 45828
num_examples: 25
download_size: 124588
dataset_size: 108288
- config_name: it
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 247018863
num_examples: 172828
- name: validation
num_bytes: 23842217
num_examples: 15179
- name: test
num_bytes: 24633330
num_examples: 15177
- name: other
num_bytes: 27060725
num_examples: 17384
download_size: 183594875
dataset_size: 322555135
- config_name: ja
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 20206831
num_examples: 15425
- name: validation
num_bytes: 10081711
num_examples: 8004
- name: test
num_bytes: 9929705
num_examples: 8004
- name: other
num_bytes: 326983657
num_examples: 263563
download_size: 204128063
dataset_size: 367201904
- config_name: ka
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 96846440
num_examples: 62537
- name: validation
num_bytes: 21317753
num_examples: 12952
- name: test
num_bytes: 21620444
num_examples: 13104
- name: other
num_bytes: 149411577
num_examples: 97022
download_size: 156957082
dataset_size: 289196214
- config_name: kk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 814528
num_examples: 605
- name: validation
num_bytes: 675867
num_examples: 513
- name: test
num_bytes: 743913
num_examples: 536
- name: other
num_bytes: 1010553
num_examples: 730
download_size: 2037985
dataset_size: 3244861
- config_name: kmr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 6008264
num_examples: 5277
- name: validation
num_bytes: 4741059
num_examples: 3999
- name: test
num_bytes: 5067586
num_examples: 3991
- name: other
num_bytes: 28620126
num_examples: 25416
download_size: 24337124
dataset_size: 44437035
- config_name: ko
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 851454
num_examples: 519
- name: validation
num_bytes: 673691
num_examples: 474
- name: test
num_bytes: 657580
num_examples: 472
- name: other
num_bytes: 5261817
num_examples: 3813
download_size: 4461063
dataset_size: 7444542
- config_name: lo
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 171277
num_examples: 98
- name: validation
num_bytes: 51859
num_examples: 28
- name: test
num_bytes: 48073
num_examples: 26
- name: other
num_bytes: 108928
num_examples: 61
download_size: 347529
dataset_size: 380137
- config_name: lt
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 11440900
num_examples: 8299
- name: validation
num_bytes: 6918684
num_examples: 5111
- name: test
num_bytes: 7556560
num_examples: 5384
- name: other
num_bytes: 3825069
num_examples: 2735
download_size: 17030066
dataset_size: 29741213
- config_name: lv
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 22032356
num_examples: 14354
- name: validation
num_bytes: 11630174
num_examples: 7705
- name: test
num_bytes: 11611189
num_examples: 7705
- name: other
num_bytes: 34299878
num_examples: 21114
download_size: 46402536
dataset_size: 79573597
- config_name: mk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2591206
num_examples: 2049
- name: validation
num_bytes: 2491435
num_examples: 1776
- name: test
num_bytes: 2532501
num_examples: 1754
- name: other
num_bytes: 33517331
num_examples: 23863
download_size: 22934774
dataset_size: 41132473
- config_name: ml
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1510367
num_examples: 1235
- name: validation
num_bytes: 1118360
num_examples: 926
- name: test
num_bytes: 1108969
num_examples: 873
- name: other
num_bytes: 7644241
num_examples: 5968
download_size: 6176964
dataset_size: 11381937
- config_name: mn
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3184227
num_examples: 2193
- name: validation
num_bytes: 2950138
num_examples: 1932
- name: test
num_bytes: 3024472
num_examples: 1933
- name: other
num_bytes: 83640126
num_examples: 59357
download_size: 51482622
dataset_size: 92798963
- config_name: mr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3643591
num_examples: 2189
- name: validation
num_bytes: 3067611
num_examples: 1766
- name: test
num_bytes: 3082336
num_examples: 1796
- name: other
num_bytes: 4896263
num_examples: 2796
download_size: 8257336
dataset_size: 14689801
- config_name: mt
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2390444
num_examples: 1910
- name: validation
num_bytes: 2082493
num_examples: 1625
- name: test
num_bytes: 2279537
num_examples: 1660
- name: other
num_bytes: 8289266
num_examples: 6288
download_size: 8679859
dataset_size: 15041740
- config_name: myv
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1926707
num_examples: 1241
- name: validation
num_bytes: 373014
num_examples: 239
- name: test
num_bytes: 761448
num_examples: 481
download_size: 1899636
dataset_size: 3061169
- config_name: nb-NO
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 255522
num_examples: 227
- name: validation
num_bytes: 44283
num_examples: 33
- name: test
num_bytes: 146887
num_examples: 116
- name: other
num_bytes: 74705
num_examples: 59
download_size: 417787
dataset_size: 521397
- config_name: ne-NP
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 401280
num_examples: 353
- name: validation
num_bytes: 358889
num_examples: 314
- name: test
num_bytes: 372285
num_examples: 287
- name: other
num_bytes: 474354
num_examples: 362
download_size: 1107410
dataset_size: 1606808
- config_name: nl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 54867973
num_examples: 43458
- name: validation
num_bytes: 15893509
num_examples: 12032
- name: test
num_bytes: 16119447
num_examples: 12033
- name: other
num_bytes: 2937523
num_examples: 2396
download_size: 50465448
dataset_size: 89818452
- config_name: nn-NO
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 527165
num_examples: 464
- name: validation
num_bytes: 504554
num_examples: 405
- name: test
num_bytes: 535528
num_examples: 423
- name: other
num_bytes: 23761
num_examples: 18
download_size: 1058381
dataset_size: 1591008
- config_name: oc
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 404991
num_examples: 304
- name: validation
num_bytes: 344313
num_examples: 267
- name: test
num_bytes: 371751
num_examples: 274
- name: other
num_bytes: 10209903
num_examples: 7707
download_size: 6584702
dataset_size: 11330958
- config_name: pa-IN
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1165822
num_examples: 800
- name: validation
num_bytes: 512761
num_examples: 406
- name: test
num_bytes: 767600
num_examples: 587
- name: other
num_bytes: 1685312
num_examples: 1243
download_size: 2464166
dataset_size: 4131495
- config_name: pl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 34542240
num_examples: 24173
- name: validation
num_bytes: 13801272
num_examples: 9856
- name: test
num_bytes: 13632158
num_examples: 9856
- name: other
num_bytes: 3513600
num_examples: 2446
download_size: 37701918
dataset_size: 65489270
- config_name: ps
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 6368249
num_examples: 4611
- name: validation
num_bytes: 5031704
num_examples: 3610
- name: test
num_bytes: 5364908
num_examples: 3610
- name: other
num_bytes: 54739210
num_examples: 41323
download_size: 39548397
dataset_size: 71504071
- config_name: pt
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 26828540
num_examples: 22923
- name: validation
num_bytes: 11950658
num_examples: 9640
- name: test
num_bytes: 12587688
num_examples: 9641
- name: other
num_bytes: 34899657
num_examples: 27353
download_size: 49488787
dataset_size: 86266543
- config_name: ro
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 5953689
num_examples: 5178
- name: validation
num_bytes: 4417234
num_examples: 3918
- name: test
num_bytes: 4673895
num_examples: 3930
- name: other
num_bytes: 26879674
num_examples: 23002
download_size: 23634928
dataset_size: 41924492
- config_name: ru
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 38559961
num_examples: 26654
- name: validation
num_bytes: 15426522
num_examples: 10243
- name: test
num_bytes: 15737108
num_examples: 10244
- name: other
num_bytes: 26314800
num_examples: 17594
download_size: 54592228
dataset_size: 96038391
- config_name: sk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 9029212
num_examples: 7354
- name: validation
num_bytes: 5868546
num_examples: 5007
- name: test
num_bytes: 6062947
num_examples: 5053
- name: other
num_bytes: 482561
num_examples: 358
download_size: 12009303
dataset_size: 21443266
- config_name: sl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1575265
num_examples: 1469
- name: validation
num_bytes: 1541092
num_examples: 1331
- name: test
num_bytes: 1660418
num_examples: 1340
- name: other
num_bytes: 4208395
num_examples: 3409
download_size: 5549160
dataset_size: 8985170
- config_name: sq
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3723175
num_examples: 2658
- name: validation
num_bytes: 2187629
num_examples: 1645
- name: test
num_bytes: 2622407
num_examples: 1917
download_size: 4896809
dataset_size: 8533211
- config_name: sr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2115757
num_examples: 2336
- name: validation
num_bytes: 1805602
num_examples: 1908
- name: test
num_bytes: 2036108
num_examples: 1977
- name: other
num_bytes: 5126546
num_examples: 4846
download_size: 5997628
dataset_size: 11084013
- config_name: sv-SE
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 9462303
num_examples: 8150
- name: validation
num_bytes: 6263828
num_examples: 5420
- name: test
num_bytes: 6872199
num_examples: 5441
- name: other
num_bytes: 7924219
num_examples: 6250
download_size: 17495947
dataset_size: 30522549
- config_name: sw
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 67157638
num_examples: 46534
- name: validation
num_bytes: 18174219
num_examples: 12253
- name: test
num_bytes: 18079470
num_examples: 12256
- name: other
num_bytes: 525610635
num_examples: 376852
download_size: 352593276
dataset_size: 629021962
- config_name: ta
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 82949308
num_examples: 46390
- name: validation
num_bytes: 19716888
num_examples: 12150
- name: test
num_bytes: 19586389
num_examples: 12237
- name: other
num_bytes: 181142817
num_examples: 105179
download_size: 163410751
dataset_size: 303395402
- config_name: te
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 87767
num_examples: 69
- name: validation
num_bytes: 83241
num_examples: 67
- name: test
num_bytes: 84150
num_examples: 66
- name: other
num_bytes: 2471552
num_examples: 2034
download_size: 1568346
dataset_size: 2726710
- config_name: tg
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 171905
num_examples: 123
- name: validation
num_bytes: 111107
num_examples: 90
- name: test
num_bytes: 90132
num_examples: 69
download_size: 315634
dataset_size: 373144
- config_name: th
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 40081786
num_examples: 32959
- name: validation
num_bytes: 14825192
num_examples: 11057
- name: test
num_bytes: 15258892
num_examples: 11057
- name: other
num_bytes: 252691132
num_examples: 208030
download_size: 175053949
dataset_size: 322857002
- config_name: tr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 45396387
num_examples: 40377
- name: validation
num_bytes: 12752317
num_examples: 11783
- name: test
num_bytes: 14250458
num_examples: 11784
- name: other
num_bytes: 145158
num_examples: 116
download_size: 40540706
dataset_size: 72544320
- config_name: uk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 33002305
num_examples: 26773
- name: validation
num_bytes: 13613732
num_examples: 10253
- name: test
num_bytes: 14140846
num_examples: 10259
- name: other
num_bytes: 10306630
num_examples: 8286
download_size: 40071670
dataset_size: 71063513
- config_name: ur
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 8800149
num_examples: 7326
- name: validation
num_bytes: 6399106
num_examples: 5082
- name: test
num_bytes: 6549143
num_examples: 5082
- name: other
num_bytes: 215492072
num_examples: 173382
download_size: 131213158
dataset_size: 237240470
- config_name: uz
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 56988342
num_examples: 48733
- name: validation
num_bytes: 15691563
num_examples: 12261
- name: test
num_bytes: 17432166
num_examples: 12365
- name: other
num_bytes: 147306240
num_examples: 128457
download_size: 132344997
dataset_size: 237418311
- config_name: vi
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2635793
num_examples: 2104
- name: validation
num_bytes: 976402
num_examples: 931
- name: test
num_bytes: 1511972
num_examples: 1367
- name: other
num_bytes: 14063781
num_examples: 12568
download_size: 10529841
dataset_size: 19187948
- config_name: yo
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2246430
num_examples: 1404
- name: validation
num_bytes: 1320273
num_examples: 913
- name: test
num_bytes: 1803934
num_examples: 1113
- name: other
num_bytes: 1919407
num_examples: 1156
download_size: 4315498
dataset_size: 7290044
- config_name: zgh
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 838586
num_examples: 842
- name: validation
num_bytes: 293213
num_examples: 297
- name: test
num_bytes: 269380
num_examples: 228
- name: other
num_bytes: 720789
num_examples: 648
download_size: 1247856
dataset_size: 2121968
- config_name: zu
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 16797
num_examples: 12
- name: test
num_bytes: 1685
num_examples: 1
- name: other
num_bytes: 107694
num_examples: 106
download_size: 112535
dataset_size: 126176
configs:
- config_name: af
data_files:
- split: train
path: af/train-*
- split: validation
path: af/validation-*
- split: test
path: af/test-*
- split: other
path: af/other-*
- config_name: am
data_files:
- split: train
path: am/train-*
- split: validation
path: am/validation-*
- split: test
path: am/test-*
- split: other
path: am/other-*
- config_name: ar
data_files:
- split: train
path: ar/train-*
- split: validation
path: ar/validation-*
- split: test
path: ar/test-*
- split: other
path: ar/other-*
- config_name: as
data_files:
- split: train
path: as/train-*
- split: validation
path: as/validation-*
- split: test
path: as/test-*
- split: other
path: as/other-*
- config_name: az
data_files:
- split: train
path: az/train-*
- split: validation
path: az/validation-*
- split: test
path: az/test-*
- split: other
path: az/other-*
- config_name: be
data_files:
- split: train
path: be/train-*
- split: validation
path: be/validation-*
- split: test
path: be/test-*
- split: other
path: be/other-*
- config_name: bg
data_files:
- split: train
path: bg/train-*
- split: validation
path: bg/validation-*
- split: test
path: bg/test-*
- split: other
path: bg/other-*
- config_name: bn
data_files:
- split: train
path: bn/train-*
- split: validation
path: bn/validation-*
- split: test
path: bn/test-*
- split: other
path: bn/other-*
- config_name: ca
data_files:
- split: train
path: ca/train-*
- split: validation
path: ca/validation-*
- split: test
path: ca/test-*
- split: other
path: ca/other-*
- config_name: cs
data_files:
- split: train
path: cs/train-*
- split: validation
path: cs/validation-*
- split: test
path: cs/test-*
- split: other
path: cs/other-*
- config_name: cy
data_files:
- split: train
path: cy/train-*
- split: validation
path: cy/validation-*
- split: test
path: cy/test-*
- split: other
path: cy/other-*
- config_name: da
data_files:
- split: train
path: da/train-*
- split: validation
path: da/validation-*
- split: test
path: da/test-*
- split: other
path: da/other-*
- config_name: el
data_files:
- split: train
path: el/train-*
- split: validation
path: el/validation-*
- split: test
path: el/test-*
- split: other
path: el/other-*
- config_name: en
data_files:
- split: train
path: en/train-*
- split: validation
path: en/validation-*
- split: test
path: en/test-*
- split: other
path: en/other-*
- config_name: es
data_files:
- split: train
path: es/train-*
- split: validation
path: es/validation-*
- split: test
path: es/test-*
- split: other
path: es/other-*
- config_name: et
data_files:
- split: train
path: et/train-*
- split: validation
path: et/validation-*
- split: test
path: et/test-*
- split: other
path: et/other-*
- config_name: eu
data_files:
- split: train
path: eu/train-*
- split: validation
path: eu/validation-*
- split: test
path: eu/test-*
- split: other
path: eu/other-*
- config_name: fa
data_files:
- split: train
path: fa/train-*
- split: validation
path: fa/validation-*
- split: test
path: fa/test-*
- split: other
path: fa/other-*
- config_name: fi
data_files:
- split: train
path: fi/train-*
- split: validation
path: fi/validation-*
- split: test
path: fi/test-*
- split: other
path: fi/other-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- split: validation
path: fr/validation-*
- split: test
path: fr/test-*
- split: other
path: fr/other-*
- config_name: gl
data_files:
- split: train
path: gl/train-*
- split: validation
path: gl/validation-*
- split: test
path: gl/test-*
- split: other
path: gl/other-*
- config_name: ha
data_files:
- split: train
path: ha/train-*
- split: validation
path: ha/validation-*
- split: test
path: ha/test-*
- split: other
path: ha/other-*
- config_name: he
data_files:
- split: train
path: he/train-*
- split: validation
path: he/validation-*
- split: test
path: he/test-*
- split: other
path: he/other-*
- config_name: hi
data_files:
- split: train
path: hi/train-*
- split: validation
path: hi/validation-*
- split: test
path: hi/test-*
- split: other
path: hi/other-*
- config_name: hu
data_files:
- split: train
path: hu/train-*
- split: validation
path: hu/validation-*
- split: test
path: hu/test-*
- split: other
path: hu/other-*
- config_name: hy-AM
data_files:
- split: train
path: hy-AM/train-*
- split: validation
path: hy-AM/validation-*
- split: test
path: hy-AM/test-*
- split: other
path: hy-AM/other-*
- config_name: ig
data_files:
- split: train
path: ig/train-*
- split: validation
path: ig/validation-*
- split: test
path: ig/test-*
- split: other
path: ig/other-*
- config_name: is
data_files:
- split: train
path: is/train-*
- split: validation
path: is/validation-*
- split: test
path: is/test-*
- split: other
path: is/other-*
- config_name: it
data_files:
- split: train
path: it/train-*
- split: validation
path: it/validation-*
- split: test
path: it/test-*
- split: other
path: it/other-*
- config_name: ja
data_files:
- split: train
path: ja/train-*
- split: validation
path: ja/validation-*
- split: test
path: ja/test-*
- split: other
path: ja/other-*
- config_name: ka
data_files:
- split: train
path: ka/train-*
- split: validation
path: ka/validation-*
- split: test
path: ka/test-*
- split: other
path: ka/other-*
- config_name: kk
data_files:
- split: train
path: kk/train-*
- split: validation
path: kk/validation-*
- split: test
path: kk/test-*
- split: other
path: kk/other-*
- config_name: kmr
data_files:
- split: train
path: kmr/train-*
- split: validation
path: kmr/validation-*
- split: test
path: kmr/test-*
- split: other
path: kmr/other-*
- config_name: ko
data_files:
- split: train
path: ko/train-*
- split: validation
path: ko/validation-*
- split: test
path: ko/test-*
- split: other
path: ko/other-*
- config_name: lo
data_files:
- split: train
path: lo/train-*
- split: validation
path: lo/validation-*
- split: test
path: lo/test-*
- split: other
path: lo/other-*
- config_name: lt
data_files:
- split: train
path: lt/train-*
- split: validation
path: lt/validation-*
- split: test
path: lt/test-*
- split: other
path: lt/other-*
- config_name: lv
data_files:
- split: train
path: lv/train-*
- split: validation
path: lv/validation-*
- split: test
path: lv/test-*
- split: other
path: lv/other-*
- config_name: mk
data_files:
- split: train
path: mk/train-*
- split: validation
path: mk/validation-*
- split: test
path: mk/test-*
- split: other
path: mk/other-*
- config_name: ml
data_files:
- split: train
path: ml/train-*
- split: validation
path: ml/validation-*
- split: test
path: ml/test-*
- split: other
path: ml/other-*
- config_name: mn
data_files:
- split: train
path: mn/train-*
- split: validation
path: mn/validation-*
- split: test
path: mn/test-*
- split: other
path: mn/other-*
- config_name: mr
data_files:
- split: train
path: mr/train-*
- split: validation
path: mr/validation-*
- split: test
path: mr/test-*
- split: other
path: mr/other-*
- config_name: mt
data_files:
- split: train
path: mt/train-*
- split: validation
path: mt/validation-*
- split: test
path: mt/test-*
- split: other
path: mt/other-*
- config_name: myv
data_files:
- split: train
path: myv/train-*
- split: validation
path: myv/validation-*
- split: test
path: myv/test-*
- config_name: nb-NO
data_files:
- split: train
path: nb-NO/train-*
- split: validation
path: nb-NO/validation-*
- split: test
path: nb-NO/test-*
- split: other
path: nb-NO/other-*
- config_name: ne-NP
data_files:
- split: train
path: ne-NP/train-*
- split: validation
path: ne-NP/validation-*
- split: test
path: ne-NP/test-*
- split: other
path: ne-NP/other-*
- config_name: nl
data_files:
- split: train
path: nl/train-*
- split: validation
path: nl/validation-*
- split: test
path: nl/test-*
- split: other
path: nl/other-*
- config_name: nn-NO
data_files:
- split: train
path: nn-NO/train-*
- split: validation
path: nn-NO/validation-*
- split: test
path: nn-NO/test-*
- split: other
path: nn-NO/other-*
- config_name: oc
data_files:
- split: train
path: oc/train-*
- split: validation
path: oc/validation-*
- split: test
path: oc/test-*
- split: other
path: oc/other-*
- config_name: pa-IN
data_files:
- split: train
path: pa-IN/train-*
- split: validation
path: pa-IN/validation-*
- split: test
path: pa-IN/test-*
- split: other
path: pa-IN/other-*
- config_name: pl
data_files:
- split: train
path: pl/train-*
- split: validation
path: pl/validation-*
- split: test
path: pl/test-*
- split: other
path: pl/other-*
- config_name: ps
data_files:
- split: train
path: ps/train-*
- split: validation
path: ps/validation-*
- split: test
path: ps/test-*
- split: other
path: ps/other-*
- config_name: pt
data_files:
- split: train
path: pt/train-*
- split: validation
path: pt/validation-*
- split: test
path: pt/test-*
- split: other
path: pt/other-*
- config_name: ro
data_files:
- split: train
path: ro/train-*
- split: validation
path: ro/validation-*
- split: test
path: ro/test-*
- split: other
path: ro/other-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- split: validation
path: ru/validation-*
- split: test
path: ru/test-*
- split: other
path: ru/other-*
- config_name: sk
data_files:
- split: train
path: sk/train-*
- split: validation
path: sk/validation-*
- split: test
path: sk/test-*
- split: other
path: sk/other-*
- config_name: sl
data_files:
- split: train
path: sl/train-*
- split: validation
path: sl/validation-*
- split: test
path: sl/test-*
- split: other
path: sl/other-*
- config_name: sq
data_files:
- split: train
path: sq/train-*
- split: validation
path: sq/validation-*
- split: test
path: sq/test-*
- config_name: sr
data_files:
- split: train
path: sr/train-*
- split: validation
path: sr/validation-*
- split: test
path: sr/test-*
- split: other
path: sr/other-*
- config_name: sv-SE
data_files:
- split: train
path: sv-SE/train-*
- split: validation
path: sv-SE/validation-*
- split: test
path: sv-SE/test-*
- split: other
path: sv-SE/other-*
- config_name: sw
data_files:
- split: train
path: sw/train-*
- split: validation
path: sw/validation-*
- split: test
path: sw/test-*
- split: other
path: sw/other-*
- config_name: ta
data_files:
- split: train
path: ta/train-*
- split: validation
path: ta/validation-*
- split: test
path: ta/test-*
- split: other
path: ta/other-*
- config_name: te
data_files:
- split: train
path: te/train-*
- split: validation
path: te/validation-*
- split: test
path: te/test-*
- split: other
path: te/other-*
- config_name: tg
data_files:
- split: train
path: tg/train-*
- split: validation
path: tg/validation-*
- split: test
path: tg/test-*
- config_name: th
data_files:
- split: train
path: th/train-*
- split: validation
path: th/validation-*
- split: test
path: th/test-*
- split: other
path: th/other-*
- config_name: tr
data_files:
- split: train
path: tr/train-*
- split: validation
path: tr/validation-*
- split: test
path: tr/test-*
- split: other
path: tr/other-*
- config_name: uk
data_files:
- split: train
path: uk/train-*
- split: validation
path: uk/validation-*
- split: test
path: uk/test-*
- split: other
path: uk/other-*
- config_name: ur
data_files:
- split: train
path: ur/train-*
- split: validation
path: ur/validation-*
- split: test
path: ur/test-*
- split: other
path: ur/other-*
- config_name: uz
data_files:
- split: train
path: uz/train-*
- split: validation
path: uz/validation-*
- split: test
path: uz/test-*
- split: other
path: uz/other-*
- config_name: vi
data_files:
- split: train
path: vi/train-*
- split: validation
path: vi/validation-*
- split: test
path: vi/test-*
- split: other
path: vi/other-*
- config_name: yo
data_files:
- split: train
path: yo/train-*
- split: validation
path: yo/validation-*
- split: test
path: yo/test-*
- split: other
path: yo/other-*
- config_name: zgh
data_files:
- split: train
path: zgh/train-*
- split: validation
path: zgh/validation-*
- split: test
path: zgh/test-*
- split: other
path: zgh/other-*
- config_name: zu
data_files:
- split: train
path: zu/train-*
- split: test
path: zu/test-*
- split: other
path: zu/other-*
---
## Dataset Statistics
The following table summarizes the number of examples for each `config_name`, the corresponding language, and each split.
| config_name | language | train_examples | validation_examples | test_examples | other_examples |
|---|---|---:|---:|---:|---:|
| af | Afrikaans | 139 | 125 | 131 | 306 |
| am | Amharic | 523 | 248 | 252 | 579 |
| ar | Arabic | 28,531 | 10,503 | 10,500 | 41,364 |
| as | Assamese | 952 | 485 | 379 | 2,557 |
| az | Azerbaijani | 157 | 78 | 95 | 529 |
| be | Belarusian | 347,672 | 15,879 | 15,880 | 17,002 |
| bg | Bulgarian | 4,952 | 2,932 | 3,354 | 1,787 |
| bn | Bengali | 21,514 | 9,382 | 9,382 | 999,246 |
| ca | Catalan | 1,208,213 | 16,414 | 16,414 | 223,303 |
| cs | Czech | 21,731 | 9,410 | 9,421 | 149,113 |
| cy | Welsh | 8,014 | 5,408 | 5,408 | 20,676 |
| da | Danish | 3,602 | 2,630 | 2,758 | 2,215 |
| el | Greek | 1,934 | 1,694 | 1,711 | 10,351 |
| en | English | 1,138,760 | 16,400 | 16,400 | 370,671 |
| es | Spanish | 353,701 | 15,893 | 15,893 | 1,142,320 |
| et | Estonian | 3,402 | 2,823 | 2,823 | 107 |
| eu | Basque | 130,043 | 14,753 | 14,753 | 115,423 |
| fa | Persian | 29,789 | 10,676 | 10,676 | 34,503 |
| fi | Finnish | 2,093 | 1,767 | 1,806 | 5,078 |
| fr | French | 593,066 | 16,186 | 16,186 | 18,829 |
| gl | Galician | 70,039 | 13,443 | 13,443 | 153,838 |
| ha | Hausa | 1,908 | 623 | 750 | 6,668 |
| he | Hebrew | 1,011 | 672 | 392 | 2,472 |
| hi | Hindi | 4,869 | 2,700 | 3,343 | 4,449 |
| hu | Hungarian | 39,270 | 11,604 | 11,659 | 50,475 |
| hy-AM | Armenian | 9,303 | 5,859 | 5,823 | 15,157 |
| ig | Igbo | 9 | 3 | 5 | 5,784 |
| is | Icelandic | 17 | 9 | 9 | 25 |
| it | Italian | 172,828 | 15,179 | 15,177 | 17,384 |
| ja | Japanese | 15,425 | 8,004 | 8,004 | 263,563 |
| ka | Georgian | 62,537 | 12,952 | 13,104 | 97,022 |
| kk | Kazakh | 605 | 513 | 536 | 730 |
| kmr | Kurdish (Kurmanji) | 5,277 | 3,999 | 3,991 | 25,416 |
| ko | Korean | 519 | 474 | 472 | 3,813 |
| lo | Lao | 98 | 28 | 26 | 61 |
| lt | Lithuanian | 8,299 | 5,111 | 5,384 | 2,735 |
| lv | Latvian | 14,354 | 7,705 | 7,705 | 21,114 |
| mk | Macedonian | 2,049 | 1,776 | 1,754 | 23,863 |
| ml | Malayalam | 1,235 | 926 | 873 | 5,968 |
| mn | Mongolian | 2,193 | 1,932 | 1,933 | 59,357 |
| mr | Marathi | 2,189 | 1,766 | 1,796 | 2,796 |
| mt | Maltese | 1,910 | 1,625 | 1,660 | 6,288 |
| myv | Erzya | 1,241 | 239 | 481 | — |
| nb-NO | Norwegian Bokmål | 227 | 33 | 116 | 59 |
| ne-NP | Nepali | 353 | 314 | 287 | 362 |
| nl | Dutch | 43,458 | 12,032 | 12,033 | 2,396 |
| nn-NO | Norwegian Nynorsk | 464 | 405 | 423 | 18 |
| oc | Occitan | 304 | 267 | 274 | 7,707 |
| pa-IN | Punjabi | 800 | 406 | 587 | 1,243 |
| pl | Polish | 24,173 | 9,856 | 9,856 | 2,446 |
| ps | Pashto | 4,611 | 3,610 | 3,610 | 41,323 |
| pt | Portuguese | 22,923 | 9,640 | 9,641 | 27,353 |
| ro | Romanian | 5,178 | 3,918 | 3,930 | 23,002 |
| ru | Russian | 26,654 | 10,243 | 10,244 | 17,594 |
| sk | Slovak | 7,354 | 5,007 | 5,053 | 358 |
| sl | Slovenian | 1,469 | 1,331 | 1,340 | 3,409 |
| sq | Albanian | 2,658 | 1,645 | 1,917 | — |
| sr | Serbian | 2,336 | 1,908 | 1,977 | 4,846 |
| sv-SE | Swedish | 8,150 | 5,420 | 5,441 | 6,250 |
| sw | Swahili | 46,534 | 12,253 | 12,256 | 376,852 |
| ta | Tamil | 46,390 | 12,150 | 12,237 | 105,179 |
| te | Telugu | 69 | 67 | 66 | 2,034 |
| tg | Tajik | 123 | 90 | 69 | — |
| th | Thai | 32,959 | 11,057 | 11,057 | 208,030 |
| tr | Turkish | 40,377 | 11,783 | 11,784 | 116 |
| uk | Ukrainian | 26,773 | 10,253 | 10,259 | 8,286 |
| ur | Urdu | 7,326 | 5,082 | 5,082 | 173,382 |
| uz | Uzbek | 48,733 | 12,261 | 12,365 | 128,457 |
| vi | Vietnamese | 2,104 | 931 | 1,367 | 12,568 |
| yo | Yoruba | 1,404 | 913 | 1,113 | 1,156 |
| zgh | Standard Moroccan Tamazight | 842 | 297 | 228 | 648 |
| zu | Zulu | 12 | — | 1 | 106 |
> `—` indicates that the split is not available for that language configuration.
数据集信息(dataset_info):
- 配置名称(config_name): af
特征(features):
- 名称: 音频路径(audio_path)
数据类型(dtype): 字符串(string)
- 名称: 时长(duration)
数据类型(dtype): 32位浮点型(float32)
- 名称: 编码序列(codes)
数据类型(dtype): 32位整型序列(sequence<int32>)
- 名称: 语句文本(sentence)
数据类型(dtype): 字符串(string)
- 名称: 客户端ID(client_id)
数据类型(dtype): 字符串(string)
划分集(splits):
- 名称: 训练集(train)
字节大小(num_bytes): 234215
样本量(num_examples): 139
- 名称: 验证集(validation)
字节大小(num_bytes): 210750
样本量(num_examples): 125
- 名称: 测试集(test)
字节大小(num_bytes): 211704
样本量(num_examples): 131
- 名称: 其他集(other)
字节大小(num_bytes): 530539
样本量(num_examples): 306
下载大小(download_size): 2916207
数据集总大小(dataset_size): 1187208
其余语言配置的结构与上述af配置完全一致,各配置的具体参数可参考下方数据集统计信息表。
## 数据集统计信息
下表汇总了各配置名称对应语言及其各划分集的样本量:
| 配置名称 | 语言 | 训练集样本量 | 验证集样本量 | 测试集样本量 | 其他集样本量 |
|---|---|---:|---:|---:|---:|
| af | 南非语 | 139 | 125 | 131 | 306 |
| am | 阿姆哈拉语 | 523 | 248 | 252 | 579 |
| ar | 阿拉伯语 | 28531 | 10503 | 10500 | 41364 |
| as | 阿萨姆语 | 952 | 485 | 379 | 2557 |
| az | 阿塞拜疆语 | 157 | 78 | 95 | 529 |
| be | 白俄罗斯语 | 347672 | 15879 | 15880 | 17002 |
| bg | 保加利亚语 | 4952 | 2932 | 3354 | 1787 |
| bn | 孟加拉语 | 21514 | 9382 | 9382 | 999246 |
| ca | 加泰罗尼亚语 | 1208213 | 16414 | 16414 | 223303 |
| cs | 捷克语 | 21731 | 9410 | 9421 | 149113 |
| cy | 威尔士语 | 8014 | 5408 | 5408 | 20676 |
| da | 丹麦语 | 3602 | 2630 | 2758 | 2215 |
| el | 希腊语 | 1934 | 1694 | 1711 | 10351 |
| en | 英语 | 1138760 | 16400 | 16400 | 370671 |
| es | 西班牙语 | 353701 | 15893 | 15893 | 1142320 |
| et | 爱沙尼亚语 | 3402 | 2823 | 2823 | 107 |
| eu | 巴斯克语 | 130043 | 14753 | 14753 | 115423 |
| fa | 波斯语 | 29789 | 10676 | 10676 | 34503 |
| fi | 芬兰语 | 2093 | 1767 | 1806 | 5078 |
| fr | 法语 | 593066 | 16186 | 16186 | 18829 |
| gl | 加利西亚语 | 70039 | 13443 | 13443 | 153838 |
| ha | 豪萨语 | 1908 | 623 | 750 | 6668 |
| he | 希伯来语 | 1011 | 672 | 392 | 2472 |
| hi | 印地语 | 4869 | 2700 | 3343 | 4449 |
| hu | 匈牙利语 | 39270 | 11604 | 11659 | 50475 |
| hy-AM | 亚美尼亚语 | 9303 | 5859 | 5823 | 15157 |
| ig | 伊博语 | 9 | 3 | 5 | 5784 |
| is | 冰岛语 | 17 | 9 | 9 | 25 |
| it | 意大利语 | 172828 | 15179 | 15177 | 17384 |
| ja | 日语 | 15425 | 8004 | 8004 | 263563 |
| ka | 格鲁吉亚语 | 62537 | 12952 | 13104 | 97022 |
| kk | 哈萨克语 | 605 | 513 | 536 | 730 |
| kmr | 库尔德语(库尔曼吉方言) | 5277 | 3999 | 3991 | 25416 |
| ko | 韩语 | 519 | 474 | 472 | 3813 |
| lo | 老挝语 | 98 | 28 | 26 | 61 |
| lt | 立陶宛语 | 8299 | 5111 | 5384 | 2735 |
| lv | 拉脱维亚语 | 14354 | 7705 | 7705 | 21114 |
| mk | 马其顿语 | 2049 | 1776 | 1754 | 23863 |
| ml | 马拉雅拉姆语 | 1235 | 926 | 873 | 5968 |
| mn | 蒙古语 | 2193 | 1932 | 1933 | 59357 |
| mr | 马拉地语 | 2189 | 1766 | 1796 | 2796 |
| mt | 马耳他语 | 1910 | 1625 | 1660 | 6288 |
| myv | 埃尔齐亚语 | 1241 | 239 | 481 | — |
| nb-NO | 挪威博克马尔语 | 227 | 33 | 116 | 59 |
| ne-NP | 尼泊尔语 | 353 | 314 | 287 | 362 |
| nl | 荷兰语 | 43458 | 12032 | 12033 | 2396 |
| nn-NO | 挪威尼诺斯克语 | 464 | 405 | 423 | 18 |
| oc | 奥克语 | 304 | 267 | 274 | 7707 |
| pa-IN | 旁遮普语(印度) | 800 | 406 | 587 | 1243 |
| pl | 波兰语 | 24173 | 9856 | 9856 | 2446 |
| ps | 普什图语 | 4611 | 3610 | 3610 | 41323 |
| pt | 葡萄牙语 | 22923 | 9640 | 9641 | 27353 |
| ro | 罗马尼亚语 | 5178 | 3918 | 3930 | 23002 |
| ru | 俄语 | 26654 | 10243 | 10244 | 17594 |
| sk | 斯洛伐克语 | 7354 | 5007 | 5053 | 358 |
| sl | 斯洛文尼亚语 | 1469 | 1331 | 1340 | 3409 |
| sq | 阿尔巴尼亚语 | 2658 | 1645 | 1917 | — |
| sr | 塞尔维亚语 | 2336 | 1908 | 1977 | 4846 |
| sv-SE | 瑞典语 | 8150 | 5420 | 5441 | 6250 |
| sw | 斯瓦希里语 | 46534 | 12253 | 12256 | 376852 |
| ta | 泰米尔语 | 46390 | 12150 | 12237 | 105179 |
| te | 泰卢固语 | 69 | 67 | 66 | 2034 |
| tg | 塔吉克语 | 123 | 90 | 69 | — |
| th | 泰语 | 32959 | 11057 | 11057 | 208030 |
| tr | 土耳其语 | 40377 | 11783 | 11784 | 116 |
| uk | 乌克兰语 | 26773 | 10253 | 10259 | 8286 |
| ur | 乌尔都语 | 7326 | 5082 | 5082 | 173382 |
| uz | 乌兹别克语 | 48733 | 12261 | 12365 | 128457 |
| vi | 越南语 | 2104 | 931 | 1367 | 12568 |
| yo | 约鲁巴语 | 1404 | 913 | 1113 | 1156 |
| zgh | 标准摩洛哥塔马齐格特语 | 842 | 297 | 228 | 648 |
| zu | 祖鲁语 | 12 | — | 1 | 106 |
> `—` 表示该语言配置下无对应划分集。
提供机构:
deepdml



