five

deepdml/cv22-neucodec

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/cv22-neucodec
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: af features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 234215 num_examples: 139 - name: validation num_bytes: 210750 num_examples: 125 - name: test num_bytes: 211704 num_examples: 131 - name: other num_bytes: 530539 num_examples: 306 download_size: 2916207 dataset_size: 1187208 - config_name: am features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 862126 num_examples: 523 - name: validation num_bytes: 388373 num_examples: 248 - name: test num_bytes: 422647 num_examples: 252 - name: other num_bytes: 1036563 num_examples: 579 download_size: 4110632 dataset_size: 2709709 - config_name: ar features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 34082906 num_examples: 28531 - name: validation num_bytes: 13185327 num_examples: 10503 - name: test num_bytes: 12850962 num_examples: 10500 - name: other num_bytes: 49125511 num_examples: 41364 download_size: 121907594 dataset_size: 109244706 - config_name: as features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1496770 num_examples: 952 - name: validation num_bytes: 710085 num_examples: 485 - name: test num_bytes: 596191 num_examples: 379 - name: other num_bytes: 4192940 num_examples: 2557 download_size: 7931300 dataset_size: 6995986 - config_name: az features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 221414 num_examples: 157 - name: validation num_bytes: 100869 num_examples: 78 - name: test num_bytes: 150855 num_examples: 95 - name: other num_bytes: 774746 num_examples: 529 download_size: 1825196 dataset_size: 1247884 - config_name: be features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 479701676 num_examples: 347672 - name: validation num_bytes: 24574704 num_examples: 15879 - name: test num_bytes: 24836588 num_examples: 15880 - name: other num_bytes: 26024040 num_examples: 17002 download_size: 311423647 dataset_size: 555137008 - config_name: bg features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 7168900 num_examples: 4952 - name: validation num_bytes: 4457729 num_examples: 2932 - name: test num_bytes: 5084572 num_examples: 3354 - name: other num_bytes: 2727148 num_examples: 1787 download_size: 10882523 dataset_size: 19438349 - config_name: bn features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 34739997 num_examples: 21514 - name: validation num_bytes: 16186658 num_examples: 9382 - name: test num_bytes: 16428595 num_examples: 9382 - name: other num_bytes: 1303726999 num_examples: 999246 download_size: 741753000 dataset_size: 1371082249 - config_name: ca features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1746515510 num_examples: 1208213 - name: validation num_bytes: 25973496 num_examples: 16414 - name: test num_bytes: 25874994 num_examples: 16414 - name: other num_bytes: 273541359 num_examples: 223303 download_size: 1161634677 dataset_size: 2071905359 - config_name: cs features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 28603672 num_examples: 21731 - name: validation num_bytes: 11983611 num_examples: 9410 - name: test num_bytes: 12023840 num_examples: 9421 - name: other num_bytes: 190308760 num_examples: 149113 download_size: 136434186 dataset_size: 242919883 - config_name: cy features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 11146324 num_examples: 8014 - name: validation num_bytes: 7901797 num_examples: 5408 - name: test num_bytes: 7896561 num_examples: 5408 - name: other num_bytes: 29824233 num_examples: 20676 download_size: 32546870 dataset_size: 56768915 - config_name: da features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 4282783 num_examples: 3602 - name: validation num_bytes: 3388363 num_examples: 2630 - name: test num_bytes: 3537954 num_examples: 2758 - name: other num_bytes: 2558218 num_examples: 2215 download_size: 7839424 dataset_size: 13767318 - config_name: el features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2287740 num_examples: 1934 - name: validation num_bytes: 2001629 num_examples: 1694 - name: test num_bytes: 2110810 num_examples: 1711 - name: other num_bytes: 12622896 num_examples: 10351 download_size: 10675779 dataset_size: 19023075 - config_name: en features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1713024261 num_examples: 1138760 - name: validation num_bytes: 25491746 num_examples: 16400 - name: test num_bytes: 25384592 num_examples: 16400 - name: other num_bytes: 537774106 num_examples: 370671 download_size: 1313672547 dataset_size: 2301674705 - config_name: es features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 495961511 num_examples: 353701 - name: validation num_bytes: 25061639 num_examples: 15893 - name: test num_bytes: 25222726 num_examples: 15893 - name: other num_bytes: 1513520844 num_examples: 1142320 download_size: 1161449826 dataset_size: 2059766720 - config_name: et features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 5975326 num_examples: 3402 - name: validation num_bytes: 4847996 num_examples: 2823 - name: test num_bytes: 4894333 num_examples: 2823 - name: other num_bytes: 163980 num_examples: 107 download_size: 9461859 dataset_size: 15881635 - config_name: eu features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 195688979 num_examples: 130043 - name: validation num_bytes: 23251143 num_examples: 14753 - name: test num_bytes: 23434069 num_examples: 14753 - name: other num_bytes: 173505583 num_examples: 115423 download_size: 238311864 dataset_size: 415879774 - config_name: fa features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 33310522 num_examples: 29789 - name: validation num_bytes: 12875044 num_examples: 10676 - name: test num_bytes: 14427351 num_examples: 10676 - name: other num_bytes: 37098112 num_examples: 34503 download_size: 54844523 dataset_size: 97711029 - config_name: fi features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2598672 num_examples: 2093 - name: validation num_bytes: 2249772 num_examples: 1767 - name: test num_bytes: 2487415 num_examples: 1806 - name: other num_bytes: 6610991 num_examples: 5078 download_size: 7995992 dataset_size: 13946850 - config_name: fr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 831402083 num_examples: 593066 - name: validation num_bytes: 24707701 num_examples: 16186 - name: test num_bytes: 24865963 num_examples: 16186 - name: other num_bytes: 27765612 num_examples: 18829 download_size: 515858563 dataset_size: 908741359 - config_name: gl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 94575749 num_examples: 70039 - name: validation num_bytes: 18366955 num_examples: 13443 - name: test num_bytes: 19035550 num_examples: 13443 - name: other num_bytes: 206135327 num_examples: 153838 download_size: 192249088 dataset_size: 338113581 - config_name: ha features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2287572 num_examples: 1908 - name: validation num_bytes: 749972 num_examples: 623 - name: test num_bytes: 960260 num_examples: 750 - name: other num_bytes: 8102486 num_examples: 6668 download_size: 6848866 dataset_size: 12100290 - config_name: he features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1254839 num_examples: 1011 - name: validation num_bytes: 894902 num_examples: 672 - name: test num_bytes: 569890 num_examples: 392 - name: other num_bytes: 3191039 num_examples: 2472 download_size: 3426207 dataset_size: 5910670 - config_name: hi features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 6261214 num_examples: 4869 - name: validation num_bytes: 3687146 num_examples: 2700 - name: test num_bytes: 4794833 num_examples: 3343 - name: other num_bytes: 7121601 num_examples: 4449 download_size: 12271651 dataset_size: 21864794 - config_name: hu features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 55012325 num_examples: 39270 - name: validation num_bytes: 16466797 num_examples: 11604 - name: test num_bytes: 17257348 num_examples: 11659 - name: other num_bytes: 77488850 num_examples: 50475 download_size: 94092317 dataset_size: 166225320 - config_name: hy-AM features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 13638764 num_examples: 9303 - name: validation num_bytes: 8758504 num_examples: 5859 - name: test num_bytes: 9039088 num_examples: 5823 - name: other num_bytes: 22530952 num_examples: 15157 download_size: 30286498 dataset_size: 53967308 - config_name: ig features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 13084 num_examples: 9 - name: validation num_bytes: 3940 num_examples: 3 - name: test num_bytes: 7050 num_examples: 5 - name: other num_bytes: 8379738 num_examples: 5784 download_size: 4800403 dataset_size: 8403812 - config_name: is features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 30244 num_examples: 17 - name: validation num_bytes: 15662 num_examples: 9 - name: test num_bytes: 16554 num_examples: 9 - name: other num_bytes: 45828 num_examples: 25 download_size: 124588 dataset_size: 108288 - config_name: it features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 247018863 num_examples: 172828 - name: validation num_bytes: 23842217 num_examples: 15179 - name: test num_bytes: 24633330 num_examples: 15177 - name: other num_bytes: 27060725 num_examples: 17384 download_size: 183594875 dataset_size: 322555135 - config_name: ja features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 20206831 num_examples: 15425 - name: validation num_bytes: 10081711 num_examples: 8004 - name: test num_bytes: 9929705 num_examples: 8004 - name: other num_bytes: 326983657 num_examples: 263563 download_size: 204128063 dataset_size: 367201904 - config_name: ka features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 96846440 num_examples: 62537 - name: validation num_bytes: 21317753 num_examples: 12952 - name: test num_bytes: 21620444 num_examples: 13104 - name: other num_bytes: 149411577 num_examples: 97022 download_size: 156957082 dataset_size: 289196214 - config_name: kk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 814528 num_examples: 605 - name: validation num_bytes: 675867 num_examples: 513 - name: test num_bytes: 743913 num_examples: 536 - name: other num_bytes: 1010553 num_examples: 730 download_size: 2037985 dataset_size: 3244861 - config_name: kmr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 6008264 num_examples: 5277 - name: validation num_bytes: 4741059 num_examples: 3999 - name: test num_bytes: 5067586 num_examples: 3991 - name: other num_bytes: 28620126 num_examples: 25416 download_size: 24337124 dataset_size: 44437035 - config_name: ko features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 851454 num_examples: 519 - name: validation num_bytes: 673691 num_examples: 474 - name: test num_bytes: 657580 num_examples: 472 - name: other num_bytes: 5261817 num_examples: 3813 download_size: 4461063 dataset_size: 7444542 - config_name: lo features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 171277 num_examples: 98 - name: validation num_bytes: 51859 num_examples: 28 - name: test num_bytes: 48073 num_examples: 26 - name: other num_bytes: 108928 num_examples: 61 download_size: 347529 dataset_size: 380137 - config_name: lt features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 11440900 num_examples: 8299 - name: validation num_bytes: 6918684 num_examples: 5111 - name: test num_bytes: 7556560 num_examples: 5384 - name: other num_bytes: 3825069 num_examples: 2735 download_size: 17030066 dataset_size: 29741213 - config_name: lv features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 22032356 num_examples: 14354 - name: validation num_bytes: 11630174 num_examples: 7705 - name: test num_bytes: 11611189 num_examples: 7705 - name: other num_bytes: 34299878 num_examples: 21114 download_size: 46402536 dataset_size: 79573597 - config_name: mk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2591206 num_examples: 2049 - name: validation num_bytes: 2491435 num_examples: 1776 - name: test num_bytes: 2532501 num_examples: 1754 - name: other num_bytes: 33517331 num_examples: 23863 download_size: 22934774 dataset_size: 41132473 - config_name: ml features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1510367 num_examples: 1235 - name: validation num_bytes: 1118360 num_examples: 926 - name: test num_bytes: 1108969 num_examples: 873 - name: other num_bytes: 7644241 num_examples: 5968 download_size: 6176964 dataset_size: 11381937 - config_name: mn features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3184227 num_examples: 2193 - name: validation num_bytes: 2950138 num_examples: 1932 - name: test num_bytes: 3024472 num_examples: 1933 - name: other num_bytes: 83640126 num_examples: 59357 download_size: 51482622 dataset_size: 92798963 - config_name: mr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3643591 num_examples: 2189 - name: validation num_bytes: 3067611 num_examples: 1766 - name: test num_bytes: 3082336 num_examples: 1796 - name: other num_bytes: 4896263 num_examples: 2796 download_size: 8257336 dataset_size: 14689801 - config_name: mt features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2390444 num_examples: 1910 - name: validation num_bytes: 2082493 num_examples: 1625 - name: test num_bytes: 2279537 num_examples: 1660 - name: other num_bytes: 8289266 num_examples: 6288 download_size: 8679859 dataset_size: 15041740 - config_name: myv features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1926707 num_examples: 1241 - name: validation num_bytes: 373014 num_examples: 239 - name: test num_bytes: 761448 num_examples: 481 download_size: 1899636 dataset_size: 3061169 - config_name: nb-NO features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 255522 num_examples: 227 - name: validation num_bytes: 44283 num_examples: 33 - name: test num_bytes: 146887 num_examples: 116 - name: other num_bytes: 74705 num_examples: 59 download_size: 417787 dataset_size: 521397 - config_name: ne-NP features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 401280 num_examples: 353 - name: validation num_bytes: 358889 num_examples: 314 - name: test num_bytes: 372285 num_examples: 287 - name: other num_bytes: 474354 num_examples: 362 download_size: 1107410 dataset_size: 1606808 - config_name: nl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 54867973 num_examples: 43458 - name: validation num_bytes: 15893509 num_examples: 12032 - name: test num_bytes: 16119447 num_examples: 12033 - name: other num_bytes: 2937523 num_examples: 2396 download_size: 50465448 dataset_size: 89818452 - config_name: nn-NO features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 527165 num_examples: 464 - name: validation num_bytes: 504554 num_examples: 405 - name: test num_bytes: 535528 num_examples: 423 - name: other num_bytes: 23761 num_examples: 18 download_size: 1058381 dataset_size: 1591008 - config_name: oc features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 404991 num_examples: 304 - name: validation num_bytes: 344313 num_examples: 267 - name: test num_bytes: 371751 num_examples: 274 - name: other num_bytes: 10209903 num_examples: 7707 download_size: 6584702 dataset_size: 11330958 - config_name: pa-IN features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1165822 num_examples: 800 - name: validation num_bytes: 512761 num_examples: 406 - name: test num_bytes: 767600 num_examples: 587 - name: other num_bytes: 1685312 num_examples: 1243 download_size: 2464166 dataset_size: 4131495 - config_name: pl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 34542240 num_examples: 24173 - name: validation num_bytes: 13801272 num_examples: 9856 - name: test num_bytes: 13632158 num_examples: 9856 - name: other num_bytes: 3513600 num_examples: 2446 download_size: 37701918 dataset_size: 65489270 - config_name: ps features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 6368249 num_examples: 4611 - name: validation num_bytes: 5031704 num_examples: 3610 - name: test num_bytes: 5364908 num_examples: 3610 - name: other num_bytes: 54739210 num_examples: 41323 download_size: 39548397 dataset_size: 71504071 - config_name: pt features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 26828540 num_examples: 22923 - name: validation num_bytes: 11950658 num_examples: 9640 - name: test num_bytes: 12587688 num_examples: 9641 - name: other num_bytes: 34899657 num_examples: 27353 download_size: 49488787 dataset_size: 86266543 - config_name: ro features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 5953689 num_examples: 5178 - name: validation num_bytes: 4417234 num_examples: 3918 - name: test num_bytes: 4673895 num_examples: 3930 - name: other num_bytes: 26879674 num_examples: 23002 download_size: 23634928 dataset_size: 41924492 - config_name: ru features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 38559961 num_examples: 26654 - name: validation num_bytes: 15426522 num_examples: 10243 - name: test num_bytes: 15737108 num_examples: 10244 - name: other num_bytes: 26314800 num_examples: 17594 download_size: 54592228 dataset_size: 96038391 - config_name: sk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 9029212 num_examples: 7354 - name: validation num_bytes: 5868546 num_examples: 5007 - name: test num_bytes: 6062947 num_examples: 5053 - name: other num_bytes: 482561 num_examples: 358 download_size: 12009303 dataset_size: 21443266 - config_name: sl features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 1575265 num_examples: 1469 - name: validation num_bytes: 1541092 num_examples: 1331 - name: test num_bytes: 1660418 num_examples: 1340 - name: other num_bytes: 4208395 num_examples: 3409 download_size: 5549160 dataset_size: 8985170 - config_name: sq features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 3723175 num_examples: 2658 - name: validation num_bytes: 2187629 num_examples: 1645 - name: test num_bytes: 2622407 num_examples: 1917 download_size: 4896809 dataset_size: 8533211 - config_name: sr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2115757 num_examples: 2336 - name: validation num_bytes: 1805602 num_examples: 1908 - name: test num_bytes: 2036108 num_examples: 1977 - name: other num_bytes: 5126546 num_examples: 4846 download_size: 5997628 dataset_size: 11084013 - config_name: sv-SE features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 9462303 num_examples: 8150 - name: validation num_bytes: 6263828 num_examples: 5420 - name: test num_bytes: 6872199 num_examples: 5441 - name: other num_bytes: 7924219 num_examples: 6250 download_size: 17495947 dataset_size: 30522549 - config_name: sw features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 67157638 num_examples: 46534 - name: validation num_bytes: 18174219 num_examples: 12253 - name: test num_bytes: 18079470 num_examples: 12256 - name: other num_bytes: 525610635 num_examples: 376852 download_size: 352593276 dataset_size: 629021962 - config_name: ta features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 82949308 num_examples: 46390 - name: validation num_bytes: 19716888 num_examples: 12150 - name: test num_bytes: 19586389 num_examples: 12237 - name: other num_bytes: 181142817 num_examples: 105179 download_size: 163410751 dataset_size: 303395402 - config_name: te features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 87767 num_examples: 69 - name: validation num_bytes: 83241 num_examples: 67 - name: test num_bytes: 84150 num_examples: 66 - name: other num_bytes: 2471552 num_examples: 2034 download_size: 1568346 dataset_size: 2726710 - config_name: tg features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 171905 num_examples: 123 - name: validation num_bytes: 111107 num_examples: 90 - name: test num_bytes: 90132 num_examples: 69 download_size: 315634 dataset_size: 373144 - config_name: th features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 40081786 num_examples: 32959 - name: validation num_bytes: 14825192 num_examples: 11057 - name: test num_bytes: 15258892 num_examples: 11057 - name: other num_bytes: 252691132 num_examples: 208030 download_size: 175053949 dataset_size: 322857002 - config_name: tr features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 45396387 num_examples: 40377 - name: validation num_bytes: 12752317 num_examples: 11783 - name: test num_bytes: 14250458 num_examples: 11784 - name: other num_bytes: 145158 num_examples: 116 download_size: 40540706 dataset_size: 72544320 - config_name: uk features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 33002305 num_examples: 26773 - name: validation num_bytes: 13613732 num_examples: 10253 - name: test num_bytes: 14140846 num_examples: 10259 - name: other num_bytes: 10306630 num_examples: 8286 download_size: 40071670 dataset_size: 71063513 - config_name: ur features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 8800149 num_examples: 7326 - name: validation num_bytes: 6399106 num_examples: 5082 - name: test num_bytes: 6549143 num_examples: 5082 - name: other num_bytes: 215492072 num_examples: 173382 download_size: 131213158 dataset_size: 237240470 - config_name: uz features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 56988342 num_examples: 48733 - name: validation num_bytes: 15691563 num_examples: 12261 - name: test num_bytes: 17432166 num_examples: 12365 - name: other num_bytes: 147306240 num_examples: 128457 download_size: 132344997 dataset_size: 237418311 - config_name: vi features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2635793 num_examples: 2104 - name: validation num_bytes: 976402 num_examples: 931 - name: test num_bytes: 1511972 num_examples: 1367 - name: other num_bytes: 14063781 num_examples: 12568 download_size: 10529841 dataset_size: 19187948 - config_name: yo features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 2246430 num_examples: 1404 - name: validation num_bytes: 1320273 num_examples: 913 - name: test num_bytes: 1803934 num_examples: 1113 - name: other num_bytes: 1919407 num_examples: 1156 download_size: 4315498 dataset_size: 7290044 - config_name: zgh features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 838586 num_examples: 842 - name: validation num_bytes: 293213 num_examples: 297 - name: test num_bytes: 269380 num_examples: 228 - name: other num_bytes: 720789 num_examples: 648 download_size: 1247856 dataset_size: 2121968 - config_name: zu features: - name: audio_path dtype: string - name: duration dtype: float32 - name: codes sequence: int32 - name: sentence dtype: string - name: client_id dtype: string splits: - name: train num_bytes: 16797 num_examples: 12 - name: test num_bytes: 1685 num_examples: 1 - name: other num_bytes: 107694 num_examples: 106 download_size: 112535 dataset_size: 126176 configs: - config_name: af data_files: - split: train path: af/train-* - split: validation path: af/validation-* - split: test path: af/test-* - split: other path: af/other-* - config_name: am data_files: - split: train path: am/train-* - split: validation path: am/validation-* - split: test path: am/test-* - split: other path: am/other-* - config_name: ar data_files: - split: train path: ar/train-* - split: validation path: ar/validation-* - split: test path: ar/test-* - split: other path: ar/other-* - config_name: as data_files: - split: train path: as/train-* - split: validation path: as/validation-* - split: test path: as/test-* - split: other path: as/other-* - config_name: az data_files: - split: train path: az/train-* - split: validation path: az/validation-* - split: test path: az/test-* - split: other path: az/other-* - config_name: be data_files: - split: train path: be/train-* - split: validation path: be/validation-* - split: test path: be/test-* - split: other path: be/other-* - config_name: bg data_files: - split: train path: bg/train-* - split: validation path: bg/validation-* - split: test path: bg/test-* - split: other path: bg/other-* - config_name: bn data_files: - split: train path: bn/train-* - split: validation path: bn/validation-* - split: test path: bn/test-* - split: other path: bn/other-* - config_name: ca data_files: - split: train path: ca/train-* - split: validation path: ca/validation-* - split: test path: ca/test-* - split: other path: ca/other-* - config_name: cs data_files: - split: train path: cs/train-* - split: validation path: cs/validation-* - split: test path: cs/test-* - split: other path: cs/other-* - config_name: cy data_files: - split: train path: cy/train-* - split: validation path: cy/validation-* - split: test path: cy/test-* - split: other path: cy/other-* - config_name: da data_files: - split: train path: da/train-* - split: validation path: da/validation-* - split: test path: da/test-* - split: other path: da/other-* - config_name: el data_files: - split: train path: el/train-* - split: validation path: el/validation-* - split: test path: el/test-* - split: other path: el/other-* - config_name: en data_files: - split: train path: en/train-* - split: validation path: en/validation-* - split: test path: en/test-* - split: other path: en/other-* - config_name: es data_files: - split: train path: es/train-* - split: validation path: es/validation-* - split: test path: es/test-* - split: other path: es/other-* - config_name: et data_files: - split: train path: et/train-* - split: validation path: et/validation-* - split: test path: et/test-* - split: other path: et/other-* - config_name: eu data_files: - split: train path: eu/train-* - split: validation path: eu/validation-* - split: test path: eu/test-* - split: other path: eu/other-* - config_name: fa data_files: - split: train path: fa/train-* - split: validation path: fa/validation-* - split: test path: fa/test-* - split: other path: fa/other-* - config_name: fi data_files: - split: train path: fi/train-* - split: validation path: fi/validation-* - split: test path: fi/test-* - split: other path: fi/other-* - config_name: fr data_files: - split: train path: fr/train-* - split: validation path: fr/validation-* - split: test path: fr/test-* - split: other path: fr/other-* - config_name: gl data_files: - split: train path: gl/train-* - split: validation path: gl/validation-* - split: test path: gl/test-* - split: other path: gl/other-* - config_name: ha data_files: - split: train path: ha/train-* - split: validation path: ha/validation-* - split: test path: ha/test-* - split: other path: ha/other-* - config_name: he data_files: - split: train path: he/train-* - split: validation path: he/validation-* - split: test path: he/test-* - split: other path: he/other-* - config_name: hi data_files: - split: train path: hi/train-* - split: validation path: hi/validation-* - split: test path: hi/test-* - split: other path: hi/other-* - config_name: hu data_files: - split: train path: hu/train-* - split: validation path: hu/validation-* - split: test path: hu/test-* - split: other path: hu/other-* - config_name: hy-AM data_files: - split: train path: hy-AM/train-* - split: validation path: hy-AM/validation-* - split: test path: hy-AM/test-* - split: other path: hy-AM/other-* - config_name: ig data_files: - split: train path: ig/train-* - split: validation path: ig/validation-* - split: test path: ig/test-* - split: other path: ig/other-* - config_name: is data_files: - split: train path: is/train-* - split: validation path: is/validation-* - split: test path: is/test-* - split: other path: is/other-* - config_name: it data_files: - split: train path: it/train-* - split: validation path: it/validation-* - split: test path: it/test-* - split: other path: it/other-* - config_name: ja data_files: - split: train path: ja/train-* - split: validation path: ja/validation-* - split: test path: ja/test-* - split: other path: ja/other-* - config_name: ka data_files: - split: train path: ka/train-* - split: validation path: ka/validation-* - split: test path: ka/test-* - split: other path: ka/other-* - config_name: kk data_files: - split: train path: kk/train-* - split: validation path: kk/validation-* - split: test path: kk/test-* - split: other path: kk/other-* - config_name: kmr data_files: - split: train path: kmr/train-* - split: validation path: kmr/validation-* - split: test path: kmr/test-* - split: other path: kmr/other-* - config_name: ko data_files: - split: train path: ko/train-* - split: validation path: ko/validation-* - split: test path: ko/test-* - split: other path: ko/other-* - config_name: lo data_files: - split: train path: lo/train-* - split: validation path: lo/validation-* - split: test path: lo/test-* - split: other path: lo/other-* - config_name: lt data_files: - split: train path: lt/train-* - split: validation path: lt/validation-* - split: test path: lt/test-* - split: other path: lt/other-* - config_name: lv data_files: - split: train path: lv/train-* - split: validation path: lv/validation-* - split: test path: lv/test-* - split: other path: lv/other-* - config_name: mk data_files: - split: train path: mk/train-* - split: validation path: mk/validation-* - split: test path: mk/test-* - split: other path: mk/other-* - config_name: ml data_files: - split: train path: ml/train-* - split: validation path: ml/validation-* - split: test path: ml/test-* - split: other path: ml/other-* - config_name: mn data_files: - split: train path: mn/train-* - split: validation path: mn/validation-* - split: test path: mn/test-* - split: other path: mn/other-* - config_name: mr data_files: - split: train path: mr/train-* - split: validation path: mr/validation-* - split: test path: mr/test-* - split: other path: mr/other-* - config_name: mt data_files: - split: train path: mt/train-* - split: validation path: mt/validation-* - split: test path: mt/test-* - split: other path: mt/other-* - config_name: myv data_files: - split: train path: myv/train-* - split: validation path: myv/validation-* - split: test path: myv/test-* - config_name: nb-NO data_files: - split: train path: nb-NO/train-* - split: validation path: nb-NO/validation-* - split: test path: nb-NO/test-* - split: other path: nb-NO/other-* - config_name: ne-NP data_files: - split: train path: ne-NP/train-* - split: validation path: ne-NP/validation-* - split: test path: ne-NP/test-* - split: other path: ne-NP/other-* - config_name: nl data_files: - split: train path: nl/train-* - split: validation path: nl/validation-* - split: test path: nl/test-* - split: other path: nl/other-* - config_name: nn-NO data_files: - split: train path: nn-NO/train-* - split: validation path: nn-NO/validation-* - split: test path: nn-NO/test-* - split: other path: nn-NO/other-* - config_name: oc data_files: - split: train path: oc/train-* - split: validation path: oc/validation-* - split: test path: oc/test-* - split: other path: oc/other-* - config_name: pa-IN data_files: - split: train path: pa-IN/train-* - split: validation path: pa-IN/validation-* - split: test path: pa-IN/test-* - split: other path: pa-IN/other-* - config_name: pl data_files: - split: train path: pl/train-* - split: validation path: pl/validation-* - split: test path: pl/test-* - split: other path: pl/other-* - config_name: ps data_files: - split: train path: ps/train-* - split: validation path: ps/validation-* - split: test path: ps/test-* - split: other path: ps/other-* - config_name: pt data_files: - split: train path: pt/train-* - split: validation path: pt/validation-* - split: test path: pt/test-* - split: other path: pt/other-* - config_name: ro data_files: - split: train path: ro/train-* - split: validation path: ro/validation-* - split: test path: ro/test-* - split: other path: ro/other-* - config_name: ru data_files: - split: train path: ru/train-* - split: validation path: ru/validation-* - split: test path: ru/test-* - split: other path: ru/other-* - config_name: sk data_files: - split: train path: sk/train-* - split: validation path: sk/validation-* - split: test path: sk/test-* - split: other path: sk/other-* - config_name: sl data_files: - split: train path: sl/train-* - split: validation path: sl/validation-* - split: test path: sl/test-* - split: other path: sl/other-* - config_name: sq data_files: - split: train path: sq/train-* - split: validation path: sq/validation-* - split: test path: sq/test-* - config_name: sr data_files: - split: train path: sr/train-* - split: validation path: sr/validation-* - split: test path: sr/test-* - split: other path: sr/other-* - config_name: sv-SE data_files: - split: train path: sv-SE/train-* - split: validation path: sv-SE/validation-* - split: test path: sv-SE/test-* - split: other path: sv-SE/other-* - config_name: sw data_files: - split: train path: sw/train-* - split: validation path: sw/validation-* - split: test path: sw/test-* - split: other path: sw/other-* - config_name: ta data_files: - split: train path: ta/train-* - split: validation path: ta/validation-* - split: test path: ta/test-* - split: other path: ta/other-* - config_name: te data_files: - split: train path: te/train-* - split: validation path: te/validation-* - split: test path: te/test-* - split: other path: te/other-* - config_name: tg data_files: - split: train path: tg/train-* - split: validation path: tg/validation-* - split: test path: tg/test-* - config_name: th data_files: - split: train path: th/train-* - split: validation path: th/validation-* - split: test path: th/test-* - split: other path: th/other-* - config_name: tr data_files: - split: train path: tr/train-* - split: validation path: tr/validation-* - split: test path: tr/test-* - split: other path: tr/other-* - config_name: uk data_files: - split: train path: uk/train-* - split: validation path: uk/validation-* - split: test path: uk/test-* - split: other path: uk/other-* - config_name: ur data_files: - split: train path: ur/train-* - split: validation path: ur/validation-* - split: test path: ur/test-* - split: other path: ur/other-* - config_name: uz data_files: - split: train path: uz/train-* - split: validation path: uz/validation-* - split: test path: uz/test-* - split: other path: uz/other-* - config_name: vi data_files: - split: train path: vi/train-* - split: validation path: vi/validation-* - split: test path: vi/test-* - split: other path: vi/other-* - config_name: yo data_files: - split: train path: yo/train-* - split: validation path: yo/validation-* - split: test path: yo/test-* - split: other path: yo/other-* - config_name: zgh data_files: - split: train path: zgh/train-* - split: validation path: zgh/validation-* - split: test path: zgh/test-* - split: other path: zgh/other-* - config_name: zu data_files: - split: train path: zu/train-* - split: test path: zu/test-* - split: other path: zu/other-* --- ## Dataset Statistics The following table summarizes the number of examples for each `config_name`, the corresponding language, and each split. | config_name | language | train_examples | validation_examples | test_examples | other_examples | |---|---|---:|---:|---:|---:| | af | Afrikaans | 139 | 125 | 131 | 306 | | am | Amharic | 523 | 248 | 252 | 579 | | ar | Arabic | 28,531 | 10,503 | 10,500 | 41,364 | | as | Assamese | 952 | 485 | 379 | 2,557 | | az | Azerbaijani | 157 | 78 | 95 | 529 | | be | Belarusian | 347,672 | 15,879 | 15,880 | 17,002 | | bg | Bulgarian | 4,952 | 2,932 | 3,354 | 1,787 | | bn | Bengali | 21,514 | 9,382 | 9,382 | 999,246 | | ca | Catalan | 1,208,213 | 16,414 | 16,414 | 223,303 | | cs | Czech | 21,731 | 9,410 | 9,421 | 149,113 | | cy | Welsh | 8,014 | 5,408 | 5,408 | 20,676 | | da | Danish | 3,602 | 2,630 | 2,758 | 2,215 | | el | Greek | 1,934 | 1,694 | 1,711 | 10,351 | | en | English | 1,138,760 | 16,400 | 16,400 | 370,671 | | es | Spanish | 353,701 | 15,893 | 15,893 | 1,142,320 | | et | Estonian | 3,402 | 2,823 | 2,823 | 107 | | eu | Basque | 130,043 | 14,753 | 14,753 | 115,423 | | fa | Persian | 29,789 | 10,676 | 10,676 | 34,503 | | fi | Finnish | 2,093 | 1,767 | 1,806 | 5,078 | | fr | French | 593,066 | 16,186 | 16,186 | 18,829 | | gl | Galician | 70,039 | 13,443 | 13,443 | 153,838 | | ha | Hausa | 1,908 | 623 | 750 | 6,668 | | he | Hebrew | 1,011 | 672 | 392 | 2,472 | | hi | Hindi | 4,869 | 2,700 | 3,343 | 4,449 | | hu | Hungarian | 39,270 | 11,604 | 11,659 | 50,475 | | hy-AM | Armenian | 9,303 | 5,859 | 5,823 | 15,157 | | ig | Igbo | 9 | 3 | 5 | 5,784 | | is | Icelandic | 17 | 9 | 9 | 25 | | it | Italian | 172,828 | 15,179 | 15,177 | 17,384 | | ja | Japanese | 15,425 | 8,004 | 8,004 | 263,563 | | ka | Georgian | 62,537 | 12,952 | 13,104 | 97,022 | | kk | Kazakh | 605 | 513 | 536 | 730 | | kmr | Kurdish (Kurmanji) | 5,277 | 3,999 | 3,991 | 25,416 | | ko | Korean | 519 | 474 | 472 | 3,813 | | lo | Lao | 98 | 28 | 26 | 61 | | lt | Lithuanian | 8,299 | 5,111 | 5,384 | 2,735 | | lv | Latvian | 14,354 | 7,705 | 7,705 | 21,114 | | mk | Macedonian | 2,049 | 1,776 | 1,754 | 23,863 | | ml | Malayalam | 1,235 | 926 | 873 | 5,968 | | mn | Mongolian | 2,193 | 1,932 | 1,933 | 59,357 | | mr | Marathi | 2,189 | 1,766 | 1,796 | 2,796 | | mt | Maltese | 1,910 | 1,625 | 1,660 | 6,288 | | myv | Erzya | 1,241 | 239 | 481 | — | | nb-NO | Norwegian Bokmål | 227 | 33 | 116 | 59 | | ne-NP | Nepali | 353 | 314 | 287 | 362 | | nl | Dutch | 43,458 | 12,032 | 12,033 | 2,396 | | nn-NO | Norwegian Nynorsk | 464 | 405 | 423 | 18 | | oc | Occitan | 304 | 267 | 274 | 7,707 | | pa-IN | Punjabi | 800 | 406 | 587 | 1,243 | | pl | Polish | 24,173 | 9,856 | 9,856 | 2,446 | | ps | Pashto | 4,611 | 3,610 | 3,610 | 41,323 | | pt | Portuguese | 22,923 | 9,640 | 9,641 | 27,353 | | ro | Romanian | 5,178 | 3,918 | 3,930 | 23,002 | | ru | Russian | 26,654 | 10,243 | 10,244 | 17,594 | | sk | Slovak | 7,354 | 5,007 | 5,053 | 358 | | sl | Slovenian | 1,469 | 1,331 | 1,340 | 3,409 | | sq | Albanian | 2,658 | 1,645 | 1,917 | — | | sr | Serbian | 2,336 | 1,908 | 1,977 | 4,846 | | sv-SE | Swedish | 8,150 | 5,420 | 5,441 | 6,250 | | sw | Swahili | 46,534 | 12,253 | 12,256 | 376,852 | | ta | Tamil | 46,390 | 12,150 | 12,237 | 105,179 | | te | Telugu | 69 | 67 | 66 | 2,034 | | tg | Tajik | 123 | 90 | 69 | — | | th | Thai | 32,959 | 11,057 | 11,057 | 208,030 | | tr | Turkish | 40,377 | 11,783 | 11,784 | 116 | | uk | Ukrainian | 26,773 | 10,253 | 10,259 | 8,286 | | ur | Urdu | 7,326 | 5,082 | 5,082 | 173,382 | | uz | Uzbek | 48,733 | 12,261 | 12,365 | 128,457 | | vi | Vietnamese | 2,104 | 931 | 1,367 | 12,568 | | yo | Yoruba | 1,404 | 913 | 1,113 | 1,156 | | zgh | Standard Moroccan Tamazight | 842 | 297 | 228 | 648 | | zu | Zulu | 12 | — | 1 | 106 | > `—` indicates that the split is not available for that language configuration.

数据集信息(dataset_info): - 配置名称(config_name): af 特征(features): - 名称: 音频路径(audio_path) 数据类型(dtype): 字符串(string) - 名称: 时长(duration) 数据类型(dtype): 32位浮点型(float32) - 名称: 编码序列(codes) 数据类型(dtype): 32位整型序列(sequence<int32>) - 名称: 语句文本(sentence) 数据类型(dtype): 字符串(string) - 名称: 客户端ID(client_id) 数据类型(dtype): 字符串(string) 划分集(splits): - 名称: 训练集(train) 字节大小(num_bytes): 234215 样本量(num_examples): 139 - 名称: 验证集(validation) 字节大小(num_bytes): 210750 样本量(num_examples): 125 - 名称: 测试集(test) 字节大小(num_bytes): 211704 样本量(num_examples): 131 - 名称: 其他集(other) 字节大小(num_bytes): 530539 样本量(num_examples): 306 下载大小(download_size): 2916207 数据集总大小(dataset_size): 1187208 其余语言配置的结构与上述af配置完全一致,各配置的具体参数可参考下方数据集统计信息表。 ## 数据集统计信息 下表汇总了各配置名称对应语言及其各划分集的样本量: | 配置名称 | 语言 | 训练集样本量 | 验证集样本量 | 测试集样本量 | 其他集样本量 | |---|---|---:|---:|---:|---:| | af | 南非语 | 139 | 125 | 131 | 306 | | am | 阿姆哈拉语 | 523 | 248 | 252 | 579 | | ar | 阿拉伯语 | 28531 | 10503 | 10500 | 41364 | | as | 阿萨姆语 | 952 | 485 | 379 | 2557 | | az | 阿塞拜疆语 | 157 | 78 | 95 | 529 | | be | 白俄罗斯语 | 347672 | 15879 | 15880 | 17002 | | bg | 保加利亚语 | 4952 | 2932 | 3354 | 1787 | | bn | 孟加拉语 | 21514 | 9382 | 9382 | 999246 | | ca | 加泰罗尼亚语 | 1208213 | 16414 | 16414 | 223303 | | cs | 捷克语 | 21731 | 9410 | 9421 | 149113 | | cy | 威尔士语 | 8014 | 5408 | 5408 | 20676 | | da | 丹麦语 | 3602 | 2630 | 2758 | 2215 | | el | 希腊语 | 1934 | 1694 | 1711 | 10351 | | en | 英语 | 1138760 | 16400 | 16400 | 370671 | | es | 西班牙语 | 353701 | 15893 | 15893 | 1142320 | | et | 爱沙尼亚语 | 3402 | 2823 | 2823 | 107 | | eu | 巴斯克语 | 130043 | 14753 | 14753 | 115423 | | fa | 波斯语 | 29789 | 10676 | 10676 | 34503 | | fi | 芬兰语 | 2093 | 1767 | 1806 | 5078 | | fr | 法语 | 593066 | 16186 | 16186 | 18829 | | gl | 加利西亚语 | 70039 | 13443 | 13443 | 153838 | | ha | 豪萨语 | 1908 | 623 | 750 | 6668 | | he | 希伯来语 | 1011 | 672 | 392 | 2472 | | hi | 印地语 | 4869 | 2700 | 3343 | 4449 | | hu | 匈牙利语 | 39270 | 11604 | 11659 | 50475 | | hy-AM | 亚美尼亚语 | 9303 | 5859 | 5823 | 15157 | | ig | 伊博语 | 9 | 3 | 5 | 5784 | | is | 冰岛语 | 17 | 9 | 9 | 25 | | it | 意大利语 | 172828 | 15179 | 15177 | 17384 | | ja | 日语 | 15425 | 8004 | 8004 | 263563 | | ka | 格鲁吉亚语 | 62537 | 12952 | 13104 | 97022 | | kk | 哈萨克语 | 605 | 513 | 536 | 730 | | kmr | 库尔德语(库尔曼吉方言) | 5277 | 3999 | 3991 | 25416 | | ko | 韩语 | 519 | 474 | 472 | 3813 | | lo | 老挝语 | 98 | 28 | 26 | 61 | | lt | 立陶宛语 | 8299 | 5111 | 5384 | 2735 | | lv | 拉脱维亚语 | 14354 | 7705 | 7705 | 21114 | | mk | 马其顿语 | 2049 | 1776 | 1754 | 23863 | | ml | 马拉雅拉姆语 | 1235 | 926 | 873 | 5968 | | mn | 蒙古语 | 2193 | 1932 | 1933 | 59357 | | mr | 马拉地语 | 2189 | 1766 | 1796 | 2796 | | mt | 马耳他语 | 1910 | 1625 | 1660 | 6288 | | myv | 埃尔齐亚语 | 1241 | 239 | 481 | — | | nb-NO | 挪威博克马尔语 | 227 | 33 | 116 | 59 | | ne-NP | 尼泊尔语 | 353 | 314 | 287 | 362 | | nl | 荷兰语 | 43458 | 12032 | 12033 | 2396 | | nn-NO | 挪威尼诺斯克语 | 464 | 405 | 423 | 18 | | oc | 奥克语 | 304 | 267 | 274 | 7707 | | pa-IN | 旁遮普语(印度) | 800 | 406 | 587 | 1243 | | pl | 波兰语 | 24173 | 9856 | 9856 | 2446 | | ps | 普什图语 | 4611 | 3610 | 3610 | 41323 | | pt | 葡萄牙语 | 22923 | 9640 | 9641 | 27353 | | ro | 罗马尼亚语 | 5178 | 3918 | 3930 | 23002 | | ru | 俄语 | 26654 | 10243 | 10244 | 17594 | | sk | 斯洛伐克语 | 7354 | 5007 | 5053 | 358 | | sl | 斯洛文尼亚语 | 1469 | 1331 | 1340 | 3409 | | sq | 阿尔巴尼亚语 | 2658 | 1645 | 1917 | — | | sr | 塞尔维亚语 | 2336 | 1908 | 1977 | 4846 | | sv-SE | 瑞典语 | 8150 | 5420 | 5441 | 6250 | | sw | 斯瓦希里语 | 46534 | 12253 | 12256 | 376852 | | ta | 泰米尔语 | 46390 | 12150 | 12237 | 105179 | | te | 泰卢固语 | 69 | 67 | 66 | 2034 | | tg | 塔吉克语 | 123 | 90 | 69 | — | | th | 泰语 | 32959 | 11057 | 11057 | 208030 | | tr | 土耳其语 | 40377 | 11783 | 11784 | 116 | | uk | 乌克兰语 | 26773 | 10253 | 10259 | 8286 | | ur | 乌尔都语 | 7326 | 5082 | 5082 | 173382 | | uz | 乌兹别克语 | 48733 | 12261 | 12365 | 128457 | | vi | 越南语 | 2104 | 931 | 1367 | 12568 | | yo | 约鲁巴语 | 1404 | 913 | 1113 | 1156 | | zgh | 标准摩洛哥塔马齐格特语 | 842 | 297 | 228 | 648 | | zu | 祖鲁语 | 12 | — | 1 | 106 | > `—` 表示该语言配置下无对应划分集。
提供机构:
deepdml
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作