five

deepdml/cv17-neucodec

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/cv17-neucodec
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ar features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: language dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 30460953 num_examples: 28369 - name: validation num_bytes: 11730231 num_examples: 10470 - name: test num_bytes: 11562939 num_examples: 10480 download_size: 34381352 dataset_size: 53754123 - config_name: ast features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 438071 num_examples: 387 - name: validation num_bytes: 118621 num_examples: 112 - name: test num_bytes: 181154 num_examples: 162 download_size: 586891 dataset_size: 737846 - config_name: be features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 434179817 num_examples: 347637 - name: validation num_bytes: 22501790 num_examples: 15880 - name: test num_bytes: 22836975 num_examples: 15878 download_size: 292420113 dataset_size: 479518582 - config_name: bg features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 6393276 num_examples: 4849 - name: validation num_bytes: 3846024 num_examples: 2766 - name: test num_bytes: 4436383 num_examples: 3201 download_size: 14993529 dataset_size: 14675683 - config_name: bn features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 31585466 num_examples: 21228 - name: validation num_bytes: 14891785 num_examples: 9327 - name: test num_bytes: 15134995 num_examples: 9327 download_size: 37958610 dataset_size: 61612246 - config_name: br features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2169476 num_examples: 2663 - name: validation num_bytes: 1935584 num_examples: 2253 - name: test num_bytes: 1932533 num_examples: 2212 download_size: 3911259 dataset_size: 6037593 - config_name: cs features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 23832634 num_examples: 20144 - name: validation num_bytes: 10355091 num_examples: 9009 - name: test num_bytes: 10390274 num_examples: 9067 download_size: 27656540 dataset_size: 44577999 - config_name: cy features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 10041116 num_examples: 7960 - name: validation num_bytes: 7128423 num_examples: 5371 - name: test num_bytes: 7159227 num_examples: 5379 download_size: 15344266 dataset_size: 24328766 - config_name: da features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3696251 num_examples: 3484 - name: validation num_bytes: 2527064 num_examples: 2105 - name: test num_bytes: 2949561 num_examples: 2530 download_size: 5834900 dataset_size: 9172876 - config_name: de features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: language dtype: string - name: client_id dtype: string splits: - name: validation num_bytes: 23713417 num_examples: 16183 - name: test num_bytes: 23727723 num_examples: 16183 download_size: 33059330 dataset_size: 47441140 - config_name: el features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1993876 num_examples: 1920 - name: validation num_bytes: 1819076 num_examples: 1700 - name: test num_bytes: 1869702 num_examples: 1701 download_size: 3626496 dataset_size: 5682654 - config_name: es features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 429304421 num_examples: 336846 - name: validation num_bytes: 22966767 num_examples: 15857 - name: test num_bytes: 23148763 num_examples: 15857 download_size: 322134033 dataset_size: 475419951 - config_name: et features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 5176734 num_examples: 3157 - name: validation num_bytes: 4229451 num_examples: 2653 - name: test num_bytes: 4299576 num_examples: 2653 download_size: 8731715 dataset_size: 13705761 - config_name: fa features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 28630935 num_examples: 28893 - name: validation num_bytes: 11372787 num_examples: 10559 - name: test num_bytes: 12952443 num_examples: 10559 download_size: 33175478 dataset_size: 52956165 - config_name: fi features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2346335 num_examples: 2076 - name: validation num_bytes: 1957486 num_examples: 1770 - name: test num_bytes: 2198447 num_examples: 1763 download_size: 4153745 dataset_size: 6502268 - config_name: fr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 709458943 num_examples: 558054 - name: validation num_bytes: 22694912 num_examples: 16159 - name: test num_bytes: 22638904 num_examples: 16159 download_size: 517846792 dataset_size: 754792759 - config_name: frold features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 709458943 num_examples: 558054 - name: validation num_bytes: 22694912 num_examples: 16159 - name: test num_bytes: 22638904 num_examples: 16159 download_size: 517846792 dataset_size: 754792759 - config_name: gl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: language dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 30191921 num_examples: 25159 - name: validation num_bytes: 12349179 num_examples: 9982 - name: test num_bytes: 12741592 num_examples: 9990 download_size: 34316752 dataset_size: 55282692 - config_name: ha features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2056583 num_examples: 1925 - name: validation num_bytes: 624632 num_examples: 582 - name: test num_bytes: 766488 num_examples: 661 download_size: 2187911 dataset_size: 3447703 - config_name: hu features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 47045495 num_examples: 37140 - name: validation num_bytes: 14662203 num_examples: 11350 - name: test num_bytes: 15444095 num_examples: 11435 download_size: 47404455 dataset_size: 77151793 - config_name: it features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 220420292 num_examples: 169771 - name: validation num_bytes: 21813816 num_examples: 15149 - name: test num_bytes: 22647856 num_examples: 15155 download_size: 178513728 dataset_size: 264881964 - config_name: ja features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 12240117 num_examples: 10039 - name: validation num_bytes: 7524689 num_examples: 6261 - name: test num_bytes: 7954724 num_examples: 6261 download_size: 18913242 dataset_size: 27719530 - config_name: ka features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 74271471 num_examples: 52321 - name: validation num_bytes: 18836543 num_examples: 12545 - name: test num_bytes: 19306286 num_examples: 12618 download_size: 65330945 dataset_size: 112414300 - config_name: ko features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 564763 num_examples: 376 - name: validation num_bytes: 421171 num_examples: 330 - name: test num_bytes: 442669 num_examples: 339 download_size: 1112061 dataset_size: 1428603 - config_name: lt features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 8907929 num_examples: 7253 - name: validation num_bytes: 5628240 num_examples: 4436 - name: test num_bytes: 6048244 num_examples: 4753 download_size: 12857806 dataset_size: 20584413 - config_name: lv features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 15512334 num_examples: 11364 - name: validation num_bytes: 8958352 num_examples: 6752 - name: test num_bytes: 9128277 num_examples: 6752 download_size: 21504444 dataset_size: 33598963 - config_name: mk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1926583 num_examples: 1686 - name: validation num_bytes: 1421773 num_examples: 1289 - name: test num_bytes: 1356147 num_examples: 1097 download_size: 2990267 dataset_size: 4704503 - config_name: ml features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1374279 num_examples: 1259 - name: validation num_bytes: 828425 num_examples: 764 - name: test num_bytes: 811581 num_examples: 710 download_size: 1881802 dataset_size: 3014285 - config_name: mn features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2889356 num_examples: 2175 - name: validation num_bytes: 2604959 num_examples: 1870 - name: test num_bytes: 2740058 num_examples: 1896 download_size: 8500081 dataset_size: 8234373 - config_name: mr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3399105 num_examples: 2215 - name: validation num_bytes: 2861426 num_examples: 1780 - name: test num_bytes: 2779299 num_examples: 1751 download_size: 5511444 dataset_size: 9039830 - config_name: nl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 39461982 num_examples: 34898 - name: validation num_bytes: 13439810 num_examples: 11252 - name: test num_bytes: 13535328 num_examples: 11266 download_size: 40969808 dataset_size: 66437120 - config_name: oc features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 332279 num_examples: 271 - name: validation num_bytes: 297044 num_examples: 260 - name: test num_bytes: 315989 num_examples: 254 download_size: 779917 dataset_size: 945312 - config_name: pl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 26764463 num_examples: 20729 - name: validation num_bytes: 11711663 num_examples: 9230 - name: test num_bytes: 11531426 num_examples: 9230 download_size: 31487788 dataset_size: 50007552 - config_name: pt features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 22736195 num_examples: 21968 - name: validation num_bytes: 10461579 num_examples: 9464 - name: test num_bytes: 11081858 num_examples: 9467 download_size: 30608614 dataset_size: 44279632 - config_name: ro features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 5207366 num_examples: 5141 - name: validation num_bytes: 3880608 num_examples: 3881 - name: test num_bytes: 4156647 num_examples: 3896 download_size: 8401238 dataset_size: 13244621 - config_name: ru features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 34775873 num_examples: 26377 - name: validation num_bytes: 13987636 num_examples: 10203 - name: test num_bytes: 14332084 num_examples: 10203 download_size: 41666792 dataset_size: 63095593 - config_name: sk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3181108 num_examples: 3258 - name: validation num_bytes: 2750760 num_examples: 2588 - name: test num_bytes: 2873529 num_examples: 2647 download_size: 5541916 dataset_size: 8805397 - config_name: sl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1255596 num_examples: 1388 - name: validation num_bytes: 1201535 num_examples: 1232 - name: test num_bytes: 1265382 num_examples: 1242 download_size: 2516994 dataset_size: 3722513 - config_name: sr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1445854 num_examples: 1879 - name: validation num_bytes: 1138340 num_examples: 1583 - name: test num_bytes: 1308490 num_examples: 1539 download_size: 2372149 dataset_size: 3892684 - config_name: sv-SE features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 7904102 num_examples: 7744 - name: validation num_bytes: 5333517 num_examples: 5210 - name: test num_bytes: 5927008 num_examples: 5259 download_size: 12141288 dataset_size: 19164627 - config_name: sw features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 61063254 num_examples: 46494 - name: validation num_bytes: 16573703 num_examples: 12251 - name: test num_bytes: 16515382 num_examples: 12253 download_size: 58272955 dataset_size: 94152339 - config_name: ta features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 75631388 num_examples: 45587 - name: validation num_bytes: 18085057 num_examples: 12095 - name: test num_bytes: 17759604 num_examples: 12074 download_size: 64136902 dataset_size: 111476049 - config_name: te features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 71171 num_examples: 62 - name: validation num_bytes: 57913 num_examples: 48 - name: test num_bytes: 54623 num_examples: 49 download_size: 187106 dataset_size: 183707 - config_name: th features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 35673023 num_examples: 32823 - name: validation num_bytes: 13377168 num_examples: 11042 - name: test num_bytes: 13807608 num_examples: 11042 download_size: 38409990 dataset_size: 62857799 - config_name: tr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 32923805 num_examples: 35147 - name: validation num_bytes: 10407339 num_examples: 11258 - name: test num_bytes: 11800580 num_examples: 11290 download_size: 34249237 dataset_size: 55131724 - config_name: uk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 27776846 num_examples: 25137 - name: validation num_bytes: 12011909 num_examples: 10007 - name: test num_bytes: 12542689 num_examples: 10011 download_size: 32213244 dataset_size: 52331444 - config_name: ur features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 5820455 num_examples: 5368 - name: validation num_bytes: 4271693 num_examples: 4057 - name: test num_bytes: 4752379 num_examples: 4056 download_size: 9389992 dataset_size: 14844527 - config_name: vi features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2580342 num_examples: 2298 - name: validation num_bytes: 591276 num_examples: 641 - name: test num_bytes: 1243206 num_examples: 1274 download_size: 2777300 dataset_size: 4414824 configs: - config_name: ar data_files: - split: train path: ar/train-* - split: validation path: ar/validation-* - split: test path: ar/test-* - config_name: ast data_files: - split: train path: ast/train-* - split: validation path: ast/validation-* - split: test path: ast/test-* - config_name: be data_files: - split: train path: be/train-* - split: validation path: be/validation-* - split: test path: be/test-* - config_name: bg data_files: - split: train path: bg/train-* - split: validation path: bg/validation-* - split: test path: bg/test-* - config_name: bn data_files: - split: train path: bn/train-* - split: validation path: bn/validation-* - split: test path: bn/test-* - config_name: br data_files: - split: train path: br/train-* - split: validation path: br/validation-* - split: test path: br/test-* - config_name: cs data_files: - split: train path: cs/train-* - split: validation path: cs/validation-* - split: test path: cs/test-* - config_name: cy data_files: - split: train path: cy/train-* - split: validation path: cy/validation-* - split: test path: cy/test-* - config_name: da data_files: - split: train path: da/train-* - split: validation path: da/validation-* - split: test path: da/test-* - config_name: de data_files: - split: validation path: de/validation-* - split: test path: de/test-* - config_name: el data_files: - split: train path: el/train-* - split: validation path: el/validation-* - split: test path: el/test-* - config_name: es data_files: - split: train path: es/train-* - split: validation path: es/validation-* - split: test path: es/test-* - config_name: et data_files: - split: train path: et/train-* - split: validation path: et/validation-* - split: test path: et/test-* - config_name: fa data_files: - split: train path: fa/train-* - split: validation path: fa/validation-* - split: test path: fa/test-* - config_name: fi data_files: - split: train path: fi/train-* - split: validation path: fi/validation-* - split: test path: fi/test-* - config_name: fr data_files: - split: train path: fr/train-* - split: validation path: fr/validation-* - split: test path: fr/test-* - config_name: frold data_files: - split: train path: frold/train-* - split: validation path: frold/validation-* - split: test path: frold/test-* - config_name: gl data_files: - split: train path: gl/train-* - split: validation path: gl/validation-* - split: test path: gl/test-* - config_name: ha data_files: - split: train path: ha/train-* - split: validation path: ha/validation-* - split: test path: ha/test-* - config_name: hu data_files: - split: train path: hu/train-* - split: validation path: hu/validation-* - split: test path: hu/test-* - config_name: it data_files: - split: train path: it/train-* - split: validation path: it/validation-* - split: test path: it/test-* - config_name: ja data_files: - split: train path: ja/train-* - split: validation path: ja/validation-* - split: test path: ja/test-* - config_name: ka data_files: - split: train path: ka/train-* - split: validation path: ka/validation-* - split: test path: ka/test-* - config_name: ko data_files: - split: train path: ko/train-* - split: validation path: ko/validation-* - split: test path: ko/test-* - config_name: lt data_files: - split: train path: lt/train-* - split: validation path: lt/validation-* - split: test path: lt/test-* - config_name: lv data_files: - split: train path: lv/train-* - split: validation path: lv/validation-* - split: test path: lv/test-* - config_name: mk data_files: - split: train path: mk/train-* - split: validation path: mk/validation-* - split: test path: mk/test-* - config_name: ml data_files: - split: train path: ml/train-* - split: validation path: ml/validation-* - split: test path: ml/test-* - config_name: mn data_files: - split: train path: mn/train-* - split: validation path: mn/validation-* - split: test path: mn/test-* - config_name: mr data_files: - split: train path: mr/train-* - split: validation path: mr/validation-* - split: test path: mr/test-* - config_name: nl data_files: - split: train path: nl/train-* - split: validation path: nl/validation-* - split: test path: nl/test-* - config_name: oc data_files: - split: train path: oc/train-* - split: validation path: oc/validation-* - split: test path: oc/test-* - config_name: pl data_files: - split: train path: pl/train-* - split: validation path: pl/validation-* - split: test path: pl/test-* - config_name: pt data_files: - split: train path: pt/train-* - split: validation path: pt/validation-* - split: test path: pt/test-* - config_name: ro data_files: - split: train path: ro/train-* - split: validation path: ro/validation-* - split: test path: ro/test-* - config_name: ru data_files: - split: train path: ru/train-* - split: validation path: ru/validation-* - split: test path: ru/test-* - config_name: sk data_files: - split: train path: sk/train-* - split: validation path: sk/validation-* - split: test path: sk/test-* - config_name: sl data_files: - split: train path: sl/train-* - split: validation path: sl/validation-* - split: test path: sl/test-* - config_name: sr data_files: - split: train path: sr/train-* - split: validation path: sr/validation-* - split: test path: sr/test-* - config_name: sv-SE data_files: - split: train path: sv-SE/train-* - split: validation path: sv-SE/validation-* - split: test path: sv-SE/test-* - config_name: sw data_files: - split: train path: sw/train-* - split: validation path: sw/validation-* - split: test path: sw/test-* - config_name: ta data_files: - split: train path: ta/train-* - split: validation path: ta/validation-* - split: test path: ta/test-* - config_name: te data_files: - split: train path: te/train-* - split: validation path: te/validation-* - split: test path: te/test-* - config_name: th data_files: - split: train path: th/train-* - split: validation path: th/validation-* - split: test path: th/test-* - config_name: tr data_files: - split: train path: tr/train-* - split: validation path: tr/validation-* - split: test path: tr/test-* - config_name: uk data_files: - split: train path: uk/train-* - split: validation path: uk/validation-* - split: test path: uk/test-* - config_name: ur data_files: - split: train path: ur/train-* - split: validation path: ur/validation-* - split: test path: ur/test-* - config_name: vi data_files: - split: train path: vi/train-* - split: validation path: vi/validation-* - split: test path: vi/test-* --- # Dataset ## Dataset Overview This dataset contains Common Voice speech data encoded into neural codec representations. Each sample includes: - `audio_path` - `duration` - `codes` - `sentence` - `language` - `client_id` The dataset is organized by language configuration and split into train, validation, and test sets when available. ## Dataset Statistics The following table summarizes the number of examples for each `config_name` and split. ## Dataset Statistics The following table summarizes the number of examples for each `config_name`, along with its corresponding `language` and available splits. | config_name | language | train_examples | validation_examples | test_examples | |---|---|---:|---:|---:| | ar | Arabic | 28,369 | 10,470 | 10,480 | | ast | Asturian | 387 | 112 | 162 | | be | Belarusian | 347,637 | 15,880 | 15,878 | | bg | Bulgarian | 4,849 | 2,766 | 3,201 | | bn | Bengali | 21,228 | 9,327 | 9,327 | | br | Breton | 2,663 | 2,253 | 2,212 | | cs | Czech | 20,144 | 9,009 | 9,067 | | cy | Welsh | 7,960 | 5,371 | 5,379 | | da | Danish | 3,484 | 2,105 | 2,530 | | de | German | — | 16,183 | 16,183 | | el | Greek | 1,920 | 1,700 | 1,701 | | es | Spanish | 336,846 | 15,857 | 15,857 | | et | Estonian | 3,157 | 2,653 | 2,653 | | fa | Persian | 28,893 | 10,559 | 10,559 | | fi | Finnish | 2,076 | 1,770 | 1,763 | | fr | French | 558,054 | 16,159 | 16,159 | | frold | Old French | 558,054 | 16,159 | 16,159 | | gl | Galician | 25,159 | 9,982 | 9,990 | | ha | Hausa | 1,925 | 582 | 661 | | hu | Hungarian | 37,140 | 11,350 | 11,435 | | it | Italian | 169,771 | 15,149 | 15,155 | | ja | Japanese | 10,039 | 6,261 | 6,261 | | ka | Georgian | 52,321 | 12,545 | 12,618 | | ko | Korean | 376 | 330 | 339 | | lt | Lithuanian | 7,253 | 4,436 | 4,753 | | lv | Latvian | 11,364 | 6,752 | 6,752 | | mk | Macedonian | 1,686 | 1,289 | 1,097 | | ml | Malayalam | 1,259 | 764 | 710 | | mn | Mongolian | 2,175 | 1,870 | 1,896 | | mr | Marathi | 2,215 | 1,780 | 1,751 | | nl | Dutch | 34,898 | 11,252 | 11,266 | | oc | Occitan | 271 | 260 | 254 | | pl | Polish | 20,729 | 9,230 | 9,230 | | pt | Portuguese | 21,968 | 9,464 | 9,467 | | ro | Romanian | 5,141 | 3,881 | 3,896 | | ru | Russian | 26,377 | 10,203 | 10,203 | | sk | Slovak | 3,258 | 2,588 | 2,647 | | sl | Slovenian | 1,388 | 1,232 | 1,242 | | sr | Serbian | 1,879 | 1,583 | 1,539 | | sv-SE | Swedish | 7,744 | 5,210 | 5,259 | | sw | Swahili | 46,494 | 12,251 | 12,253 | | ta | Tamil | 45,587 | 12,095 | 12,074 | | te | Telugu | 62 | 48 | 49 | | th | Thai | 32,823 | 11,042 | 11,042 | | tr | Turkish | 35,147 | 11,258 | 11,290 | | uk | Ukrainian | 25,137 | 10,007 | 10,011 | | ur | Urdu | 5,368 | 4,057 | 4,056 | | vi | Vietnamese | 2,298 | 641 | 1,274 | ### Notes - Most configurations include `train`, `validation`, and `test` splits. - `de` currently includes only `validation` and `test` splits in the dataset metadata. - The `language` column provides a readable language name for each dataset configuration. ## Features - `audio_path` (`string`): path to the audio sample - `duration` (`float32`): audio duration in seconds - `codes` (`sequence[int32]`): neural codec token sequence - `sentence` (`string`): transcription text - `language` (`string`): language code - `client_id` (`string`): speaker/client identifier ## Usage ```python from datasets import load_dataset dataset = load_dataset("deepdml/commonvoice-neucodec", "ar") print(dataset) ```
提供机构:
deepdml
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作