five

hotchpotch/nllb-english-bitext-hq

收藏
Hugging Face2026-02-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/nllb-english-bitext-hq
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: afr_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 477233974 num_examples: 2506253 download_size: 290735420 dataset_size: 477233974 - config_name: als_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1210602393 num_examples: 4308029 download_size: 765764482 dataset_size: 1210602393 - config_name: arb_Arab features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2768457648 num_examples: 7823346 download_size: 1584214889 dataset_size: 2768457648 - config_name: ast_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 55333464 num_examples: 262313 download_size: 34361201 dataset_size: 55333464 - config_name: azj_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 12590640 num_examples: 77813 download_size: 7082764 dataset_size: 12590640 - config_name: bel_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 42326374 num_examples: 241685 download_size: 23341136 dataset_size: 42326374 - config_name: ben_Beng features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 670551040 num_examples: 2107123 download_size: 325068908 dataset_size: 670551040 - config_name: bre_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 569188 num_examples: 4528 download_size: 265117 dataset_size: 569188 - config_name: bul_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1267898451 num_examples: 3972297 download_size: 699972419 dataset_size: 1267898451 - config_name: cat_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 791190425 num_examples: 3338666 download_size: 496493891 dataset_size: 791190425 - config_name: ceb_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 937574 num_examples: 6552 download_size: 438761 dataset_size: 937574 - config_name: ces_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2135352536 num_examples: 7464670 download_size: 1399675417 dataset_size: 2135352536 - config_name: dan_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2192292058 num_examples: 7721310 download_size: 1394265758 dataset_size: 2192292058 - config_name: deu_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2242544225 num_examples: 8820954 download_size: 1402305956 dataset_size: 2242544225 - config_name: ell_Grek features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2986465462 num_examples: 7171541 download_size: 1672115722 dataset_size: 2986465462 - config_name: epo_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1187167763 num_examples: 4119058 download_size: 790952966 dataset_size: 1187167763 - config_name: est_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 697640815 num_examples: 3185349 download_size: 437145330 dataset_size: 697640815 - config_name: eus_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 189097266 num_examples: 1091280 download_size: 113212880 dataset_size: 189097266 - config_name: fin_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1837298712 num_examples: 6805803 download_size: 1169313710 dataset_size: 1837298712 - config_name: fra_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2249351614 num_examples: 8779308 download_size: 1384434067 dataset_size: 2249351614 - config_name: fry_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 23262633 num_examples: 169720 download_size: 13458061 dataset_size: 23262633 - config_name: gla_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 135913 num_examples: 1377 download_size: 68049 dataset_size: 135913 - config_name: gle_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 231408603 num_examples: 1143612 download_size: 145652518 dataset_size: 231408603 - config_name: glg_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 491628308 num_examples: 2593749 download_size: 305335704 dataset_size: 491628308 - config_name: guj_Gujr features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 979318339 num_examples: 2853600 download_size: 486507148 dataset_size: 979318339 - config_name: hat_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 47577631 num_examples: 251480 download_size: 31231090 dataset_size: 47577631 - config_name: hau_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 13087999 num_examples: 47709 download_size: 3999205 dataset_size: 13087999 - config_name: heb_Hebr features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 778274594 num_examples: 3413852 download_size: 441801820 dataset_size: 778274594 - config_name: hin_Deva features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 791684606 num_examples: 2270876 download_size: 389714841 dataset_size: 791684606 - config_name: hrv_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 724348331 num_examples: 3146308 download_size: 463838503 dataset_size: 724348331 - config_name: hun_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1981517035 num_examples: 6850414 download_size: 1272401117 dataset_size: 1981517035 - config_name: hye_Armn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 327319828 num_examples: 1182510 download_size: 185169670 dataset_size: 327319828 - config_name: ibo_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 899001 num_examples: 6776 download_size: 398267 dataset_size: 899001 - config_name: ilo_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 856298 num_examples: 5850 download_size: 391329 dataset_size: 856298 - config_name: ind_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2170223836 num_examples: 8001918 download_size: 1318792596 dataset_size: 2170223836 - config_name: isl_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 325658994 num_examples: 1985550 download_size: 192271073 dataset_size: 325658994 - config_name: ita_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2308578213 num_examples: 8600910 download_size: 1447684117 dataset_size: 2308578213 - config_name: jav_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 222345070 num_examples: 1105178 download_size: 141293556 dataset_size: 222345070 - config_name: jpn_Jpan features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1143890941 num_examples: 4643143 download_size: 714286139 dataset_size: 1143890941 - config_name: kan_Knda features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1063420152 num_examples: 2949512 download_size: 511989897 dataset_size: 1063420152 - config_name: kat_Geor features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 282694914 num_examples: 955993 download_size: 136339893 dataset_size: 282694914 - config_name: kaz_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 371866101 num_examples: 1619591 download_size: 206717080 dataset_size: 371866101 - config_name: khk_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 216148331 num_examples: 1016298 download_size: 119945269 dataset_size: 216148331 - config_name: kin_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 24674479 num_examples: 144233 download_size: 16113209 dataset_size: 24674479 - config_name: kir_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 240056146 num_examples: 1104321 download_size: 135094711 dataset_size: 240056146 - config_name: kor_Hang features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 947828791 num_examples: 3500753 download_size: 601969723 dataset_size: 947828791 - config_name: lat_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 18604797 num_examples: 139856 download_size: 10722549 dataset_size: 18604797 - config_name: lit_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 767419886 num_examples: 3287855 download_size: 484049410 dataset_size: 767419886 - config_name: ltz_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 92352850 num_examples: 580215 download_size: 60413824 dataset_size: 92352850 - config_name: lvs_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 60662924 num_examples: 274868 download_size: 37055650 dataset_size: 60662924 - config_name: mal_Mlym features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 760287615 num_examples: 2583061 download_size: 356248841 dataset_size: 760287615 - config_name: mar_Deva features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 447639396 num_examples: 1935657 download_size: 219334638 dataset_size: 447639396 - config_name: mkd_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 934358063 num_examples: 3120744 download_size: 509341533 dataset_size: 934358063 - config_name: mlt_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 126380928 num_examples: 395657 download_size: 79372258 dataset_size: 126380928 - config_name: mya_Mymr features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 251978942 num_examples: 964409 download_size: 119394198 dataset_size: 251978942 - config_name: nld_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2318041902 num_examples: 8573018 download_size: 1456744386 dataset_size: 2318041902 - config_name: nob_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 767022603 num_examples: 3721296 download_size: 477746429 dataset_size: 767022603 - config_name: npi_Deva features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 393342106 num_examples: 1582342 download_size: 193645976 dataset_size: 393342106 - config_name: oci_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 31531530 num_examples: 251336 download_size: 17762288 dataset_size: 31531530 - config_name: ory_Orya features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 359123607 num_examples: 1460880 download_size: 174305202 dataset_size: 359123607 - config_name: pan_Guru features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 408233095 num_examples: 1412843 download_size: 204715210 dataset_size: 408233095 - config_name: pap_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 65776192 num_examples: 491934 download_size: 41868980 dataset_size: 65776192 - config_name: pbt_Arab features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 281695138 num_examples: 1144240 download_size: 160387676 dataset_size: 281695138 - config_name: pes_Arab features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2066973748 num_examples: 6209954 download_size: 1162742137 dataset_size: 2066973748 - config_name: plt_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 105068057 num_examples: 451126 download_size: 59577038 dataset_size: 105068057 - config_name: pol_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2310544716 num_examples: 8044358 download_size: 1495465655 dataset_size: 2310544716 - config_name: por_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2243948320 num_examples: 8643185 download_size: 1398264719 dataset_size: 2243948320 - config_name: ron_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1307862849 num_examples: 4378083 download_size: 853400687 dataset_size: 1307862849 - config_name: rus_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 3120481869 num_examples: 8486540 download_size: 1752229133 dataset_size: 3120481869 - config_name: sin_Sinh features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 445159841 num_examples: 1885142 download_size: 226866794 dataset_size: 445159841 - config_name: slk_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 791002686 num_examples: 3559252 download_size: 508050829 dataset_size: 791002686 - config_name: slv_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 701999384 num_examples: 3209953 download_size: 445484696 dataset_size: 701999384 - config_name: snd_Arab features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 50170771 num_examples: 177527 download_size: 22953989 dataset_size: 50170771 - config_name: som_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2288568 num_examples: 8729 download_size: 758503 dataset_size: 2288568 - config_name: spa_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2134491826 num_examples: 8795040 download_size: 1315072103 dataset_size: 2134491826 - config_name: srp_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 729103397 num_examples: 3193478 download_size: 444593119 dataset_size: 729103397 - config_name: sun_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 154176557 num_examples: 812042 download_size: 98511663 dataset_size: 154176557 - config_name: swe_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2247324775 num_examples: 8250514 download_size: 1419538631 dataset_size: 2247324775 - config_name: swh_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 317196552 num_examples: 1328160 download_size: 197940514 dataset_size: 317196552 - config_name: tam_Taml features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 759689430 num_examples: 2536971 download_size: 349800070 dataset_size: 759689430 - config_name: tat_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 87298087 num_examples: 493370 download_size: 49049276 dataset_size: 87298087 - config_name: tel_Telu features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 995445822 num_examples: 2540610 download_size: 479123861 dataset_size: 995445822 - config_name: tgk_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 76532053 num_examples: 347528 download_size: 42880033 dataset_size: 76532053 - config_name: tgl_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 579306357 num_examples: 2624002 download_size: 359367672 dataset_size: 579306357 - config_name: tur_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1729229613 num_examples: 6886531 download_size: 1097257410 dataset_size: 1729229613 - config_name: ukr_Cyrl features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1156172002 num_examples: 3740610 download_size: 656048070 dataset_size: 1156172002 - config_name: urd_Arab features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 601667955 num_examples: 2313544 download_size: 340835734 dataset_size: 601667955 - config_name: uzn_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 307952970 num_examples: 1428827 download_size: 190953047 dataset_size: 307952970 - config_name: vie_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 2225352917 num_examples: 7301273 download_size: 1313321524 dataset_size: 2225352917 - config_name: xho_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 34974 num_examples: 288 download_size: 19033 dataset_size: 34974 - config_name: ydd_Hebr features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 605968 num_examples: 5135 download_size: 257121 dataset_size: 605968 - config_name: yor_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 13321604 num_examples: 60034 download_size: 8509688 dataset_size: 13321604 - config_name: zho_Hans features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 1455622905 num_examples: 6386961 download_size: 1000638030 dataset_size: 1455622905 - config_name: zho_Hant features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 111538029 num_examples: 783217 download_size: 72847696 dataset_size: 111538029 - config_name: zsm_Latn features: - name: english dtype: string - name: translated dtype: string - name: reranker_score dtype: float64 - name: original_index_id dtype: int64 - name: score dtype: float64 - name: laser_score dtype: float64 splits: - name: train num_bytes: 925367494 num_examples: 3710177 download_size: 568278718 dataset_size: 925367494 configs: - config_name: afr_Latn data_files: - split: train path: afr_Latn/train-* - config_name: als_Latn data_files: - split: train path: als_Latn/train-* - config_name: arb_Arab data_files: - split: train path: arb_Arab/train-* - config_name: ast_Latn data_files: - split: train path: ast_Latn/train-* - config_name: azj_Latn data_files: - split: train path: azj_Latn/train-* - config_name: bel_Cyrl data_files: - split: train path: bel_Cyrl/train-* - config_name: ben_Beng data_files: - split: train path: ben_Beng/train-* - config_name: bre_Latn data_files: - split: train path: bre_Latn/train-* - config_name: bul_Cyrl data_files: - split: train path: bul_Cyrl/train-* - config_name: cat_Latn data_files: - split: train path: cat_Latn/train-* - config_name: ceb_Latn data_files: - split: train path: ceb_Latn/train-* - config_name: ces_Latn data_files: - split: train path: ces_Latn/train-* - config_name: dan_Latn data_files: - split: train path: dan_Latn/train-* - config_name: deu_Latn data_files: - split: train path: deu_Latn/train-* - config_name: ell_Grek data_files: - split: train path: ell_Grek/train-* - config_name: epo_Latn data_files: - split: train path: epo_Latn/train-* - config_name: est_Latn data_files: - split: train path: est_Latn/train-* - config_name: eus_Latn data_files: - split: train path: eus_Latn/train-* - config_name: fin_Latn data_files: - split: train path: fin_Latn/train-* - config_name: fra_Latn data_files: - split: train path: fra_Latn/train-* - config_name: fry_Latn data_files: - split: train path: fry_Latn/train-* - config_name: gla_Latn data_files: - split: train path: gla_Latn/train-* - config_name: gle_Latn data_files: - split: train path: gle_Latn/train-* - config_name: glg_Latn data_files: - split: train path: glg_Latn/train-* - config_name: guj_Gujr data_files: - split: train path: guj_Gujr/train-* - config_name: hat_Latn data_files: - split: train path: hat_Latn/train-* - config_name: hau_Latn data_files: - split: train path: hau_Latn/train-* - config_name: heb_Hebr data_files: - split: train path: heb_Hebr/train-* - config_name: hin_Deva data_files: - split: train path: hin_Deva/train-* - config_name: hrv_Latn data_files: - split: train path: hrv_Latn/train-* - config_name: hun_Latn data_files: - split: train path: hun_Latn/train-* - config_name: hye_Armn data_files: - split: train path: hye_Armn/train-* - config_name: ibo_Latn data_files: - split: train path: ibo_Latn/train-* - config_name: ilo_Latn data_files: - split: train path: ilo_Latn/train-* - config_name: ind_Latn data_files: - split: train path: ind_Latn/train-* - config_name: isl_Latn data_files: - split: train path: isl_Latn/train-* - config_name: ita_Latn data_files: - split: train path: ita_Latn/train-* - config_name: jav_Latn data_files: - split: train path: jav_Latn/train-* - config_name: jpn_Jpan data_files: - split: train path: jpn_Jpan/train-* - config_name: kan_Knda data_files: - split: train path: kan_Knda/train-* - config_name: kat_Geor data_files: - split: train path: kat_Geor/train-* - config_name: kaz_Cyrl data_files: - split: train path: kaz_Cyrl/train-* - config_name: khk_Cyrl data_files: - split: train path: khk_Cyrl/train-* - config_name: kin_Latn data_files: - split: train path: kin_Latn/train-* - config_name: kir_Cyrl data_files: - split: train path: kir_Cyrl/train-* - config_name: kor_Hang data_files: - split: train path: kor_Hang/train-* - config_name: lat_Latn data_files: - split: train path: lat_Latn/train-* - config_name: lit_Latn data_files: - split: train path: lit_Latn/train-* - config_name: ltz_Latn data_files: - split: train path: ltz_Latn/train-* - config_name: lvs_Latn data_files: - split: train path: lvs_Latn/train-* - config_name: mal_Mlym data_files: - split: train path: mal_Mlym/train-* - config_name: mar_Deva data_files: - split: train path: mar_Deva/train-* - config_name: mkd_Cyrl data_files: - split: train path: mkd_Cyrl/train-* - config_name: mlt_Latn data_files: - split: train path: mlt_Latn/train-* - config_name: mya_Mymr data_files: - split: train path: mya_Mymr/train-* - config_name: nld_Latn data_files: - split: train path: nld_Latn/train-* - config_name: nob_Latn data_files: - split: train path: nob_Latn/train-* - config_name: npi_Deva data_files: - split: train path: npi_Deva/train-* - config_name: oci_Latn data_files: - split: train path: oci_Latn/train-* - config_name: ory_Orya data_files: - split: train path: ory_Orya/train-* - config_name: pan_Guru data_files: - split: train path: pan_Guru/train-* - config_name: pap_Latn data_files: - split: train path: pap_Latn/train-* - config_name: pbt_Arab data_files: - split: train path: pbt_Arab/train-* - config_name: pes_Arab data_files: - split: train path: pes_Arab/train-* - config_name: plt_Latn data_files: - split: train path: plt_Latn/train-* - config_name: pol_Latn data_files: - split: train path: pol_Latn/train-* - config_name: por_Latn data_files: - split: train path: por_Latn/train-* - config_name: ron_Latn data_files: - split: train path: ron_Latn/train-* - config_name: rus_Cyrl data_files: - split: train path: rus_Cyrl/train-* - config_name: sin_Sinh data_files: - split: train path: sin_Sinh/train-* - config_name: slk_Latn data_files: - split: train path: slk_Latn/train-* - config_name: slv_Latn data_files: - split: train path: slv_Latn/train-* - config_name: snd_Arab data_files: - split: train path: snd_Arab/train-* - config_name: som_Latn data_files: - split: train path: som_Latn/train-* - config_name: spa_Latn data_files: - split: train path: spa_Latn/train-* - config_name: srp_Cyrl data_files: - split: train path: srp_Cyrl/train-* - config_name: sun_Latn data_files: - split: train path: sun_Latn/train-* - config_name: swe_Latn data_files: - split: train path: swe_Latn/train-* - config_name: swh_Latn data_files: - split: train path: swh_Latn/train-* - config_name: tam_Taml data_files: - split: train path: tam_Taml/train-* - config_name: tat_Cyrl data_files: - split: train path: tat_Cyrl/train-* - config_name: tel_Telu data_files: - split: train path: tel_Telu/train-* - config_name: tgk_Cyrl data_files: - split: train path: tgk_Cyrl/train-* - config_name: tgl_Latn data_files: - split: train path: tgl_Latn/train-* - config_name: tur_Latn data_files: - split: train path: tur_Latn/train-* - config_name: ukr_Cyrl data_files: - split: train path: ukr_Cyrl/train-* - config_name: urd_Arab data_files: - split: train path: urd_Arab/train-* - config_name: uzn_Latn data_files: - split: train path: uzn_Latn/train-* - config_name: vie_Latn data_files: - split: train path: vie_Latn/train-* - config_name: xho_Latn data_files: - split: train path: xho_Latn/train-* - config_name: ydd_Hebr data_files: - split: train path: ydd_Hebr/train-* - config_name: yor_Latn data_files: - split: train path: yor_Latn/train-* - config_name: zho_Hans data_files: - split: train path: zho_Hans/train-* - config_name: zho_Hant data_files: - split: train path: zho_Hant/train-* - config_name: zsm_Latn data_files: - split: train path: zsm_Latn/train-* --- # nllb-english-bitext-hq 🚧 This dataset is under active development and may change. This dataset contains filtered English-non-English bitext from NLLB ([allenai/nllb](https://huggingface.co/datasets/allenai/nllb)) and CCMatrix. It is intended for multilingual embedding training, reranker training, and cross-lingual retrieval experiments. Each row has these fields: - `english`: English sentence - `translated`: Non-English sentence - `reranker_score`: score from `BAAI/bge-reranker-v2-m3` - `original_index_id`: row index in the source data - `score`: embedding similarity score from BGE-m3 - `laser_score`: LASER score from source data ⚠️ Important note: This dataset is filtered with BGE-m3 and `BAAI/bge-reranker-v2-m3`. That means model bias can affect what is kept. For languages where BGE-m3 or the reranker is less reliable, useful pairs may be dropped and the remaining distribution may shift. ## Dataset creation process (rough) For very large source subsets, we first apply random sampling to cap the working set size before scoring and ranking. All configs are produced with one rough pipeline: - Text preprocessing - `BAAI/bge-reranker-v2-m3` scoring and filtering - BGE-m3 candidate ranking and Top-K sampling (`top-1` or `top-2`; smaller subsets use `top-2`) - Near-duplicate cleanup (applied when needed) ## Language subsets | Config | Rows | Source | | --- | ---: | --- | | arb_Arab | 7,823,346 | NLLB | | ben_Beng | 2,107,123 | NLLB | | ces_Latn | 7,464,670 | CCMatrix | | dan_Latn | 7,721,310 | CCMatrix | | deu_Latn | 8,820,954 | NLLB | | ell_Grek | 7,171,541 | NLLB | | afr_Latn | 2,506,253 | CCMatrix | | als_Latn | 4,308,029 | NLLB | | ast_Latn | 262,313 | CCMatrix | | azj_Latn | 77,813 | CCMatrix | | bel_Cyrl | 241,685 | CCMatrix | | bre_Latn | 4,528 | CCMatrix | | bul_Cyrl | 3,972,297 | CCMatrix | | cat_Latn | 3,338,666 | CCMatrix | | ceb_Latn | 6,552 | CCMatrix | | epo_Latn | 4,119,058 | NLLB | | est_Latn | 3,185,349 | CCMatrix | | eus_Latn | 1,091,280 | CCMatrix | | fry_Latn | 169,720 | CCMatrix | | gla_Latn | 1,377 | CCMatrix | | gle_Latn | 1,143,612 | NLLB | | glg_Latn | 2,593,749 | CCMatrix | | guj_Gujr | 2,853,600 | NLLB | | hat_Latn | 251,480 | NLLB | | hau_Latn | 47,709 | CCMatrix | | heb_Hebr | 3,413,852 | CCMatrix | | hrv_Latn | 3,146,308 | CCMatrix | | hye_Armn | 1,182,510 | NLLB | | ibo_Latn | 6,776 | CCMatrix | | ilo_Latn | 5,850 | CCMatrix | | isl_Latn | 1,985,550 | CCMatrix | | jav_Latn | 1,105,178 | NLLB | | kan_Knda | 2,949,512 | NLLB | | kat_Geor | 955,993 | NLLB | | kaz_Cyrl | 1,619,591 | NLLB | | khk_Cyrl | 1,016,298 | NLLB | | kin_Latn | 144,233 | NLLB | | kir_Cyrl | 1,104,321 | NLLB | | lat_Latn | 139,856 | CCMatrix | | lit_Latn | 3,287,855 | CCMatrix | | ltz_Latn | 580,215 | NLLB | | lvs_Latn | 274,868 | CCMatrix | | mal_Mlym | 2,583,061 | NLLB | | mar_Deva | 1,935,657 | NLLB | | mkd_Cyrl | 3,120,744 | CCMatrix | | mlt_Latn | 395,657 | NLLB | | mya_Mymr | 964,409 | NLLB | | nob_Latn | 3,721,296 | CCMatrix | | npi_Deva | 1,582,342 | NLLB | | oci_Latn | 251,336 | CCMatrix | | ory_Orya | 1,460,880 | NLLB | | pan_Guru | 1,412,843 | NLLB | | pap_Latn | 491,934 | NLLB | | pbt_Arab | 1,144,240 | NLLB | | plt_Latn | 451,126 | CCMatrix | | ron_Latn | 4,378,083 | NLLB | | sin_Sinh | 1,885,142 | NLLB | | slk_Latn | 3,559,252 | CCMatrix | | slv_Latn | 3,209,953 | CCMatrix | | snd_Arab | 177,527 | CCMatrix | | som_Latn | 8,729 | CCMatrix | | srp_Cyrl | 3,193,478 | CCMatrix | | sun_Latn | 812,042 | NLLB | | tam_Taml | 2,536,971 | NLLB | | tat_Cyrl | 493,370 | NLLB | | tgk_Cyrl | 347,528 | NLLB | | tgl_Latn | 2,624,002 | NLLB | | ukr_Cyrl | 3,740,610 | CCMatrix | | urd_Arab | 2,313,544 | NLLB | | uzn_Latn | 1,428,827 | NLLB | | xho_Latn | 288 | CCMatrix | | ydd_Hebr | 5,135 | CCMatrix | | zho_Hant | 783,217 | NLLB | | zsm_Latn | 3,710,177 | NLLB | | fin_Latn | 6,805,803 | CCMatrix | | fra_Latn | 8,779,308 | NLLB | | hin_Deva | 2,270,876 | NLLB | | hun_Latn | 6,850,414 | CCMatrix | | ind_Latn | 8,001,918 | NLLB | | ita_Latn | 8,600,910 | NLLB | | jpn_Jpan | 4,643,143 | CCMatrix | | kor_Hang | 3,500,753 | CCMatrix | | nld_Latn | 8,573,018 | NLLB | | pes_Arab | 6,209,954 | CCMatrix | | pol_Latn | 8,044,358 | NLLB | | por_Latn | 8,643,185 | NLLB | | rus_Cyrl | 8,486,540 | NLLB | | spa_Latn | 8,795,040 | NLLB | | swe_Latn | 8,250,514 | NLLB | | swh_Latn | 1,328,160 | NLLB | | tel_Telu | 2,540,610 | NLLB | | tur_Latn | 6,886,531 | CCMatrix | | vie_Latn | 7,301,273 | CCMatrix | | yor_Latn | 60,034 | CCMatrix | | zho_Hans | 6,386,961 | NLLB | ## License - NLLB-derived subsets follow the NLLB dataset license (ODC-BY): https://huggingface.co/datasets/allenai/nllb - CCMatrix-derived subsets follow CCMatrix/Common Crawl terms. Check upstream terms if you need strict license compliance. ## Citation and attribution - For NLLB-derived subsets, cite the [NLLB paper](https://arxiv.org/abs/2207.04672) - For CCMatrix-derived subsets, cite the [CCMatrix paper](https://arxiv.org/abs/1911.04944)
提供机构:
hotchpotch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作