deepdml/cv17-neucodec
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/cv17-neucodec
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ar
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 30460953
num_examples: 28369
- name: validation
num_bytes: 11730231
num_examples: 10470
- name: test
num_bytes: 11562939
num_examples: 10480
download_size: 34381352
dataset_size: 53754123
- config_name: ast
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 438071
num_examples: 387
- name: validation
num_bytes: 118621
num_examples: 112
- name: test
num_bytes: 181154
num_examples: 162
download_size: 586891
dataset_size: 737846
- config_name: be
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 434179817
num_examples: 347637
- name: validation
num_bytes: 22501790
num_examples: 15880
- name: test
num_bytes: 22836975
num_examples: 15878
download_size: 292420113
dataset_size: 479518582
- config_name: bg
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 6393276
num_examples: 4849
- name: validation
num_bytes: 3846024
num_examples: 2766
- name: test
num_bytes: 4436383
num_examples: 3201
download_size: 14993529
dataset_size: 14675683
- config_name: bn
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 31585466
num_examples: 21228
- name: validation
num_bytes: 14891785
num_examples: 9327
- name: test
num_bytes: 15134995
num_examples: 9327
download_size: 37958610
dataset_size: 61612246
- config_name: br
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2169476
num_examples: 2663
- name: validation
num_bytes: 1935584
num_examples: 2253
- name: test
num_bytes: 1932533
num_examples: 2212
download_size: 3911259
dataset_size: 6037593
- config_name: cs
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 23832634
num_examples: 20144
- name: validation
num_bytes: 10355091
num_examples: 9009
- name: test
num_bytes: 10390274
num_examples: 9067
download_size: 27656540
dataset_size: 44577999
- config_name: cy
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 10041116
num_examples: 7960
- name: validation
num_bytes: 7128423
num_examples: 5371
- name: test
num_bytes: 7159227
num_examples: 5379
download_size: 15344266
dataset_size: 24328766
- config_name: da
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3696251
num_examples: 3484
- name: validation
num_bytes: 2527064
num_examples: 2105
- name: test
num_bytes: 2949561
num_examples: 2530
download_size: 5834900
dataset_size: 9172876
- config_name: de
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: validation
num_bytes: 23713417
num_examples: 16183
- name: test
num_bytes: 23727723
num_examples: 16183
download_size: 33059330
dataset_size: 47441140
- config_name: el
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1993876
num_examples: 1920
- name: validation
num_bytes: 1819076
num_examples: 1700
- name: test
num_bytes: 1869702
num_examples: 1701
download_size: 3626496
dataset_size: 5682654
- config_name: es
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 429304421
num_examples: 336846
- name: validation
num_bytes: 22966767
num_examples: 15857
- name: test
num_bytes: 23148763
num_examples: 15857
download_size: 322134033
dataset_size: 475419951
- config_name: et
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 5176734
num_examples: 3157
- name: validation
num_bytes: 4229451
num_examples: 2653
- name: test
num_bytes: 4299576
num_examples: 2653
download_size: 8731715
dataset_size: 13705761
- config_name: fa
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 28630935
num_examples: 28893
- name: validation
num_bytes: 11372787
num_examples: 10559
- name: test
num_bytes: 12952443
num_examples: 10559
download_size: 33175478
dataset_size: 52956165
- config_name: fi
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2346335
num_examples: 2076
- name: validation
num_bytes: 1957486
num_examples: 1770
- name: test
num_bytes: 2198447
num_examples: 1763
download_size: 4153745
dataset_size: 6502268
- config_name: fr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 709458943
num_examples: 558054
- name: validation
num_bytes: 22694912
num_examples: 16159
- name: test
num_bytes: 22638904
num_examples: 16159
download_size: 517846792
dataset_size: 754792759
- config_name: frold
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 709458943
num_examples: 558054
- name: validation
num_bytes: 22694912
num_examples: 16159
- name: test
num_bytes: 22638904
num_examples: 16159
download_size: 517846792
dataset_size: 754792759
- config_name: gl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 30191921
num_examples: 25159
- name: validation
num_bytes: 12349179
num_examples: 9982
- name: test
num_bytes: 12741592
num_examples: 9990
download_size: 34316752
dataset_size: 55282692
- config_name: ha
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2056583
num_examples: 1925
- name: validation
num_bytes: 624632
num_examples: 582
- name: test
num_bytes: 766488
num_examples: 661
download_size: 2187911
dataset_size: 3447703
- config_name: hu
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 47045495
num_examples: 37140
- name: validation
num_bytes: 14662203
num_examples: 11350
- name: test
num_bytes: 15444095
num_examples: 11435
download_size: 47404455
dataset_size: 77151793
- config_name: it
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 220420292
num_examples: 169771
- name: validation
num_bytes: 21813816
num_examples: 15149
- name: test
num_bytes: 22647856
num_examples: 15155
download_size: 178513728
dataset_size: 264881964
- config_name: ja
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 12240117
num_examples: 10039
- name: validation
num_bytes: 7524689
num_examples: 6261
- name: test
num_bytes: 7954724
num_examples: 6261
download_size: 18913242
dataset_size: 27719530
- config_name: ka
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 74271471
num_examples: 52321
- name: validation
num_bytes: 18836543
num_examples: 12545
- name: test
num_bytes: 19306286
num_examples: 12618
download_size: 65330945
dataset_size: 112414300
- config_name: ko
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 564763
num_examples: 376
- name: validation
num_bytes: 421171
num_examples: 330
- name: test
num_bytes: 442669
num_examples: 339
download_size: 1112061
dataset_size: 1428603
- config_name: lt
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 8907929
num_examples: 7253
- name: validation
num_bytes: 5628240
num_examples: 4436
- name: test
num_bytes: 6048244
num_examples: 4753
download_size: 12857806
dataset_size: 20584413
- config_name: lv
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 15512334
num_examples: 11364
- name: validation
num_bytes: 8958352
num_examples: 6752
- name: test
num_bytes: 9128277
num_examples: 6752
download_size: 21504444
dataset_size: 33598963
- config_name: mk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1926583
num_examples: 1686
- name: validation
num_bytes: 1421773
num_examples: 1289
- name: test
num_bytes: 1356147
num_examples: 1097
download_size: 2990267
dataset_size: 4704503
- config_name: ml
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1374279
num_examples: 1259
- name: validation
num_bytes: 828425
num_examples: 764
- name: test
num_bytes: 811581
num_examples: 710
download_size: 1881802
dataset_size: 3014285
- config_name: mn
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2889356
num_examples: 2175
- name: validation
num_bytes: 2604959
num_examples: 1870
- name: test
num_bytes: 2740058
num_examples: 1896
download_size: 8500081
dataset_size: 8234373
- config_name: mr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3399105
num_examples: 2215
- name: validation
num_bytes: 2861426
num_examples: 1780
- name: test
num_bytes: 2779299
num_examples: 1751
download_size: 5511444
dataset_size: 9039830
- config_name: nl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 39461982
num_examples: 34898
- name: validation
num_bytes: 13439810
num_examples: 11252
- name: test
num_bytes: 13535328
num_examples: 11266
download_size: 40969808
dataset_size: 66437120
- config_name: oc
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 332279
num_examples: 271
- name: validation
num_bytes: 297044
num_examples: 260
- name: test
num_bytes: 315989
num_examples: 254
download_size: 779917
dataset_size: 945312
- config_name: pl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 26764463
num_examples: 20729
- name: validation
num_bytes: 11711663
num_examples: 9230
- name: test
num_bytes: 11531426
num_examples: 9230
download_size: 31487788
dataset_size: 50007552
- config_name: pt
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 22736195
num_examples: 21968
- name: validation
num_bytes: 10461579
num_examples: 9464
- name: test
num_bytes: 11081858
num_examples: 9467
download_size: 30608614
dataset_size: 44279632
- config_name: ro
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 5207366
num_examples: 5141
- name: validation
num_bytes: 3880608
num_examples: 3881
- name: test
num_bytes: 4156647
num_examples: 3896
download_size: 8401238
dataset_size: 13244621
- config_name: ru
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 34775873
num_examples: 26377
- name: validation
num_bytes: 13987636
num_examples: 10203
- name: test
num_bytes: 14332084
num_examples: 10203
download_size: 41666792
dataset_size: 63095593
- config_name: sk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 3181108
num_examples: 3258
- name: validation
num_bytes: 2750760
num_examples: 2588
- name: test
num_bytes: 2873529
num_examples: 2647
download_size: 5541916
dataset_size: 8805397
- config_name: sl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1255596
num_examples: 1388
- name: validation
num_bytes: 1201535
num_examples: 1232
- name: test
num_bytes: 1265382
num_examples: 1242
download_size: 2516994
dataset_size: 3722513
- config_name: sr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 1445854
num_examples: 1879
- name: validation
num_bytes: 1138340
num_examples: 1583
- name: test
num_bytes: 1308490
num_examples: 1539
download_size: 2372149
dataset_size: 3892684
- config_name: sv-SE
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 7904102
num_examples: 7744
- name: validation
num_bytes: 5333517
num_examples: 5210
- name: test
num_bytes: 5927008
num_examples: 5259
download_size: 12141288
dataset_size: 19164627
- config_name: sw
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 61063254
num_examples: 46494
- name: validation
num_bytes: 16573703
num_examples: 12251
- name: test
num_bytes: 16515382
num_examples: 12253
download_size: 58272955
dataset_size: 94152339
- config_name: ta
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 75631388
num_examples: 45587
- name: validation
num_bytes: 18085057
num_examples: 12095
- name: test
num_bytes: 17759604
num_examples: 12074
download_size: 64136902
dataset_size: 111476049
- config_name: te
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 71171
num_examples: 62
- name: validation
num_bytes: 57913
num_examples: 48
- name: test
num_bytes: 54623
num_examples: 49
download_size: 187106
dataset_size: 183707
- config_name: th
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 35673023
num_examples: 32823
- name: validation
num_bytes: 13377168
num_examples: 11042
- name: test
num_bytes: 13807608
num_examples: 11042
download_size: 38409990
dataset_size: 62857799
- config_name: tr
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 32923805
num_examples: 35147
- name: validation
num_bytes: 10407339
num_examples: 11258
- name: test
num_bytes: 11800580
num_examples: 11290
download_size: 34249237
dataset_size: 55131724
- config_name: uk
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 27776846
num_examples: 25137
- name: validation
num_bytes: 12011909
num_examples: 10007
- name: test
num_bytes: 12542689
num_examples: 10011
download_size: 32213244
dataset_size: 52331444
- config_name: ur
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 5820455
num_examples: 5368
- name: validation
num_bytes: 4271693
num_examples: 4057
- name: test
num_bytes: 4752379
num_examples: 4056
download_size: 9389992
dataset_size: 14844527
- config_name: vi
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 2580342
num_examples: 2298
- name: validation
num_bytes: 591276
num_examples: 641
- name: test
num_bytes: 1243206
num_examples: 1274
download_size: 2777300
dataset_size: 4414824
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- split: validation
path: ar/validation-*
- split: test
path: ar/test-*
- config_name: ast
data_files:
- split: train
path: ast/train-*
- split: validation
path: ast/validation-*
- split: test
path: ast/test-*
- config_name: be
data_files:
- split: train
path: be/train-*
- split: validation
path: be/validation-*
- split: test
path: be/test-*
- config_name: bg
data_files:
- split: train
path: bg/train-*
- split: validation
path: bg/validation-*
- split: test
path: bg/test-*
- config_name: bn
data_files:
- split: train
path: bn/train-*
- split: validation
path: bn/validation-*
- split: test
path: bn/test-*
- config_name: br
data_files:
- split: train
path: br/train-*
- split: validation
path: br/validation-*
- split: test
path: br/test-*
- config_name: cs
data_files:
- split: train
path: cs/train-*
- split: validation
path: cs/validation-*
- split: test
path: cs/test-*
- config_name: cy
data_files:
- split: train
path: cy/train-*
- split: validation
path: cy/validation-*
- split: test
path: cy/test-*
- config_name: da
data_files:
- split: train
path: da/train-*
- split: validation
path: da/validation-*
- split: test
path: da/test-*
- config_name: de
data_files:
- split: validation
path: de/validation-*
- split: test
path: de/test-*
- config_name: el
data_files:
- split: train
path: el/train-*
- split: validation
path: el/validation-*
- split: test
path: el/test-*
- config_name: es
data_files:
- split: train
path: es/train-*
- split: validation
path: es/validation-*
- split: test
path: es/test-*
- config_name: et
data_files:
- split: train
path: et/train-*
- split: validation
path: et/validation-*
- split: test
path: et/test-*
- config_name: fa
data_files:
- split: train
path: fa/train-*
- split: validation
path: fa/validation-*
- split: test
path: fa/test-*
- config_name: fi
data_files:
- split: train
path: fi/train-*
- split: validation
path: fi/validation-*
- split: test
path: fi/test-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- split: validation
path: fr/validation-*
- split: test
path: fr/test-*
- config_name: frold
data_files:
- split: train
path: frold/train-*
- split: validation
path: frold/validation-*
- split: test
path: frold/test-*
- config_name: gl
data_files:
- split: train
path: gl/train-*
- split: validation
path: gl/validation-*
- split: test
path: gl/test-*
- config_name: ha
data_files:
- split: train
path: ha/train-*
- split: validation
path: ha/validation-*
- split: test
path: ha/test-*
- config_name: hu
data_files:
- split: train
path: hu/train-*
- split: validation
path: hu/validation-*
- split: test
path: hu/test-*
- config_name: it
data_files:
- split: train
path: it/train-*
- split: validation
path: it/validation-*
- split: test
path: it/test-*
- config_name: ja
data_files:
- split: train
path: ja/train-*
- split: validation
path: ja/validation-*
- split: test
path: ja/test-*
- config_name: ka
data_files:
- split: train
path: ka/train-*
- split: validation
path: ka/validation-*
- split: test
path: ka/test-*
- config_name: ko
data_files:
- split: train
path: ko/train-*
- split: validation
path: ko/validation-*
- split: test
path: ko/test-*
- config_name: lt
data_files:
- split: train
path: lt/train-*
- split: validation
path: lt/validation-*
- split: test
path: lt/test-*
- config_name: lv
data_files:
- split: train
path: lv/train-*
- split: validation
path: lv/validation-*
- split: test
path: lv/test-*
- config_name: mk
data_files:
- split: train
path: mk/train-*
- split: validation
path: mk/validation-*
- split: test
path: mk/test-*
- config_name: ml
data_files:
- split: train
path: ml/train-*
- split: validation
path: ml/validation-*
- split: test
path: ml/test-*
- config_name: mn
data_files:
- split: train
path: mn/train-*
- split: validation
path: mn/validation-*
- split: test
path: mn/test-*
- config_name: mr
data_files:
- split: train
path: mr/train-*
- split: validation
path: mr/validation-*
- split: test
path: mr/test-*
- config_name: nl
data_files:
- split: train
path: nl/train-*
- split: validation
path: nl/validation-*
- split: test
path: nl/test-*
- config_name: oc
data_files:
- split: train
path: oc/train-*
- split: validation
path: oc/validation-*
- split: test
path: oc/test-*
- config_name: pl
data_files:
- split: train
path: pl/train-*
- split: validation
path: pl/validation-*
- split: test
path: pl/test-*
- config_name: pt
data_files:
- split: train
path: pt/train-*
- split: validation
path: pt/validation-*
- split: test
path: pt/test-*
- config_name: ro
data_files:
- split: train
path: ro/train-*
- split: validation
path: ro/validation-*
- split: test
path: ro/test-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- split: validation
path: ru/validation-*
- split: test
path: ru/test-*
- config_name: sk
data_files:
- split: train
path: sk/train-*
- split: validation
path: sk/validation-*
- split: test
path: sk/test-*
- config_name: sl
data_files:
- split: train
path: sl/train-*
- split: validation
path: sl/validation-*
- split: test
path: sl/test-*
- config_name: sr
data_files:
- split: train
path: sr/train-*
- split: validation
path: sr/validation-*
- split: test
path: sr/test-*
- config_name: sv-SE
data_files:
- split: train
path: sv-SE/train-*
- split: validation
path: sv-SE/validation-*
- split: test
path: sv-SE/test-*
- config_name: sw
data_files:
- split: train
path: sw/train-*
- split: validation
path: sw/validation-*
- split: test
path: sw/test-*
- config_name: ta
data_files:
- split: train
path: ta/train-*
- split: validation
path: ta/validation-*
- split: test
path: ta/test-*
- config_name: te
data_files:
- split: train
path: te/train-*
- split: validation
path: te/validation-*
- split: test
path: te/test-*
- config_name: th
data_files:
- split: train
path: th/train-*
- split: validation
path: th/validation-*
- split: test
path: th/test-*
- config_name: tr
data_files:
- split: train
path: tr/train-*
- split: validation
path: tr/validation-*
- split: test
path: tr/test-*
- config_name: uk
data_files:
- split: train
path: uk/train-*
- split: validation
path: uk/validation-*
- split: test
path: uk/test-*
- config_name: ur
data_files:
- split: train
path: ur/train-*
- split: validation
path: ur/validation-*
- split: test
path: ur/test-*
- config_name: vi
data_files:
- split: train
path: vi/train-*
- split: validation
path: vi/validation-*
- split: test
path: vi/test-*
---
# Dataset
## Dataset Overview
This dataset contains Common Voice speech data encoded into neural codec representations.
Each sample includes:
- `audio_path`
- `duration`
- `codes`
- `sentence`
- `language`
- `client_id`
The dataset is organized by language configuration and split into train, validation, and test sets when available.
## Dataset Statistics
The following table summarizes the number of examples for each `config_name` and split.
## Dataset Statistics
The following table summarizes the number of examples for each `config_name`, along with its corresponding `language` and available splits.
| config_name | language | train_examples | validation_examples | test_examples |
|---|---|---:|---:|---:|
| ar | Arabic | 28,369 | 10,470 | 10,480 |
| ast | Asturian | 387 | 112 | 162 |
| be | Belarusian | 347,637 | 15,880 | 15,878 |
| bg | Bulgarian | 4,849 | 2,766 | 3,201 |
| bn | Bengali | 21,228 | 9,327 | 9,327 |
| br | Breton | 2,663 | 2,253 | 2,212 |
| cs | Czech | 20,144 | 9,009 | 9,067 |
| cy | Welsh | 7,960 | 5,371 | 5,379 |
| da | Danish | 3,484 | 2,105 | 2,530 |
| de | German | — | 16,183 | 16,183 |
| el | Greek | 1,920 | 1,700 | 1,701 |
| es | Spanish | 336,846 | 15,857 | 15,857 |
| et | Estonian | 3,157 | 2,653 | 2,653 |
| fa | Persian | 28,893 | 10,559 | 10,559 |
| fi | Finnish | 2,076 | 1,770 | 1,763 |
| fr | French | 558,054 | 16,159 | 16,159 |
| frold | Old French | 558,054 | 16,159 | 16,159 |
| gl | Galician | 25,159 | 9,982 | 9,990 |
| ha | Hausa | 1,925 | 582 | 661 |
| hu | Hungarian | 37,140 | 11,350 | 11,435 |
| it | Italian | 169,771 | 15,149 | 15,155 |
| ja | Japanese | 10,039 | 6,261 | 6,261 |
| ka | Georgian | 52,321 | 12,545 | 12,618 |
| ko | Korean | 376 | 330 | 339 |
| lt | Lithuanian | 7,253 | 4,436 | 4,753 |
| lv | Latvian | 11,364 | 6,752 | 6,752 |
| mk | Macedonian | 1,686 | 1,289 | 1,097 |
| ml | Malayalam | 1,259 | 764 | 710 |
| mn | Mongolian | 2,175 | 1,870 | 1,896 |
| mr | Marathi | 2,215 | 1,780 | 1,751 |
| nl | Dutch | 34,898 | 11,252 | 11,266 |
| oc | Occitan | 271 | 260 | 254 |
| pl | Polish | 20,729 | 9,230 | 9,230 |
| pt | Portuguese | 21,968 | 9,464 | 9,467 |
| ro | Romanian | 5,141 | 3,881 | 3,896 |
| ru | Russian | 26,377 | 10,203 | 10,203 |
| sk | Slovak | 3,258 | 2,588 | 2,647 |
| sl | Slovenian | 1,388 | 1,232 | 1,242 |
| sr | Serbian | 1,879 | 1,583 | 1,539 |
| sv-SE | Swedish | 7,744 | 5,210 | 5,259 |
| sw | Swahili | 46,494 | 12,251 | 12,253 |
| ta | Tamil | 45,587 | 12,095 | 12,074 |
| te | Telugu | 62 | 48 | 49 |
| th | Thai | 32,823 | 11,042 | 11,042 |
| tr | Turkish | 35,147 | 11,258 | 11,290 |
| uk | Ukrainian | 25,137 | 10,007 | 10,011 |
| ur | Urdu | 5,368 | 4,057 | 4,056 |
| vi | Vietnamese | 2,298 | 641 | 1,274 |
### Notes
- Most configurations include `train`, `validation`, and `test` splits.
- `de` currently includes only `validation` and `test` splits in the dataset metadata.
- The `language` column provides a readable language name for each dataset configuration.
## Features
- `audio_path` (`string`): path to the audio sample
- `duration` (`float32`): audio duration in seconds
- `codes` (`sequence[int32]`): neural codec token sequence
- `sentence` (`string`): transcription text
- `language` (`string`): language code
- `client_id` (`string`): speaker/client identifier
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("deepdml/commonvoice-neucodec", "ar")
print(dataset)
```
提供机构:
deepdml



