five

SaylorTwift/mteb-bitext-mining-aggregated

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SaylorTwift/mteb-bitext-mining-aggregated
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - multilingual license: apache-2.0 tags: - bitext-mining - sentence-embeddings - mteb - multilingual task_categories: - sentence-similarity pretty_name: MTEB BitextMining Aggregated Dataset (Full) size_categories: - 100K<n<1M configs: - config_name: BUCC_v2 data_files: - split: fr_en path: BUCC_v2/fr_en-* - split: ru_en path: BUCC_v2/ru_en-* - split: de_en path: BUCC_v2/de_en-* - split: zh_en path: BUCC_v2/zh_en-* - config_name: BornholmBitextMining data_files: - split: default path: BornholmBitextMining/default-* - config_name: DiaBlaBitextMining data_files: - split: en_fr path: DiaBlaBitextMining/en_fr-* - split: fr_en path: DiaBlaBitextMining/fr_en-* - config_name: IN22GenBitextMining data_files: - split: asm_Beng_ben_Beng path: IN22GenBitextMining/asm_Beng_ben_Beng-* - split: asm_Beng_brx_Deva path: IN22GenBitextMining/asm_Beng_brx_Deva-* - split: asm_Beng_doi_Deva path: IN22GenBitextMining/asm_Beng_doi_Deva-* - split: asm_Beng_eng_Latn path: IN22GenBitextMining/asm_Beng_eng_Latn-* - split: asm_Beng_gom_Deva path: IN22GenBitextMining/asm_Beng_gom_Deva-* - split: asm_Beng_guj_Gujr path: IN22GenBitextMining/asm_Beng_guj_Gujr-* - split: asm_Beng_hin_Deva path: IN22GenBitextMining/asm_Beng_hin_Deva-* - split: asm_Beng_kan_Knda path: IN22GenBitextMining/asm_Beng_kan_Knda-* - split: asm_Beng_kas_Arab path: IN22GenBitextMining/asm_Beng_kas_Arab-* - split: asm_Beng_mai_Deva path: IN22GenBitextMining/asm_Beng_mai_Deva-* - split: asm_Beng_mal_Mlym path: IN22GenBitextMining/asm_Beng_mal_Mlym-* - split: asm_Beng_mar_Deva path: IN22GenBitextMining/asm_Beng_mar_Deva-* - split: asm_Beng_mni_Mtei path: IN22GenBitextMining/asm_Beng_mni_Mtei-* - split: asm_Beng_npi_Deva path: IN22GenBitextMining/asm_Beng_npi_Deva-* - split: asm_Beng_ory_Orya path: IN22GenBitextMining/asm_Beng_ory_Orya-* - split: asm_Beng_pan_Guru path: IN22GenBitextMining/asm_Beng_pan_Guru-* - split: asm_Beng_san_Deva path: IN22GenBitextMining/asm_Beng_san_Deva-* - split: asm_Beng_sat_Olck path: IN22GenBitextMining/asm_Beng_sat_Olck-* - split: asm_Beng_snd_Deva path: IN22GenBitextMining/asm_Beng_snd_Deva-* - split: asm_Beng_tam_Taml path: IN22GenBitextMining/asm_Beng_tam_Taml-* - split: asm_Beng_tel_Telu path: IN22GenBitextMining/asm_Beng_tel_Telu-* - split: asm_Beng_urd_Arab path: IN22GenBitextMining/asm_Beng_urd_Arab-* - split: ben_Beng_asm_Beng path: IN22GenBitextMining/ben_Beng_asm_Beng-* - split: ben_Beng_brx_Deva path: IN22GenBitextMining/ben_Beng_brx_Deva-* - split: ben_Beng_doi_Deva path: IN22GenBitextMining/ben_Beng_doi_Deva-* - split: ben_Beng_eng_Latn path: IN22GenBitextMining/ben_Beng_eng_Latn-* - split: ben_Beng_gom_Deva path: IN22GenBitextMining/ben_Beng_gom_Deva-* - split: ben_Beng_guj_Gujr path: IN22GenBitextMining/ben_Beng_guj_Gujr-* - split: ben_Beng_hin_Deva path: IN22GenBitextMining/ben_Beng_hin_Deva-* - split: ben_Beng_kan_Knda path: IN22GenBitextMining/ben_Beng_kan_Knda-* - split: ben_Beng_kas_Arab path: IN22GenBitextMining/ben_Beng_kas_Arab-* - split: ben_Beng_mai_Deva path: IN22GenBitextMining/ben_Beng_mai_Deva-* - split: ben_Beng_mal_Mlym path: IN22GenBitextMining/ben_Beng_mal_Mlym-* - split: ben_Beng_mar_Deva path: IN22GenBitextMining/ben_Beng_mar_Deva-* - split: ben_Beng_mni_Mtei path: IN22GenBitextMining/ben_Beng_mni_Mtei-* - split: ben_Beng_npi_Deva path: IN22GenBitextMining/ben_Beng_npi_Deva-* - split: ben_Beng_ory_Orya path: IN22GenBitextMining/ben_Beng_ory_Orya-* - split: ben_Beng_pan_Guru path: IN22GenBitextMining/ben_Beng_pan_Guru-* - split: ben_Beng_san_Deva path: IN22GenBitextMining/ben_Beng_san_Deva-* - split: ben_Beng_sat_Olck path: IN22GenBitextMining/ben_Beng_sat_Olck-* - split: ben_Beng_snd_Deva path: IN22GenBitextMining/ben_Beng_snd_Deva-* - split: ben_Beng_tam_Taml path: IN22GenBitextMining/ben_Beng_tam_Taml-* - split: ben_Beng_tel_Telu path: IN22GenBitextMining/ben_Beng_tel_Telu-* - split: ben_Beng_urd_Arab path: IN22GenBitextMining/ben_Beng_urd_Arab-* - split: brx_Deva_asm_Beng path: IN22GenBitextMining/brx_Deva_asm_Beng-* - split: brx_Deva_ben_Beng path: IN22GenBitextMining/brx_Deva_ben_Beng-* - split: brx_Deva_doi_Deva path: IN22GenBitextMining/brx_Deva_doi_Deva-* - split: brx_Deva_eng_Latn path: IN22GenBitextMining/brx_Deva_eng_Latn-* - split: brx_Deva_gom_Deva path: IN22GenBitextMining/brx_Deva_gom_Deva-* - split: brx_Deva_guj_Gujr path: IN22GenBitextMining/brx_Deva_guj_Gujr-* - split: brx_Deva_hin_Deva path: IN22GenBitextMining/brx_Deva_hin_Deva-* - split: brx_Deva_kan_Knda path: IN22GenBitextMining/brx_Deva_kan_Knda-* - split: brx_Deva_kas_Arab path: IN22GenBitextMining/brx_Deva_kas_Arab-* - split: brx_Deva_mai_Deva path: IN22GenBitextMining/brx_Deva_mai_Deva-* - split: brx_Deva_mal_Mlym path: IN22GenBitextMining/brx_Deva_mal_Mlym-* - split: brx_Deva_mar_Deva path: IN22GenBitextMining/brx_Deva_mar_Deva-* - split: brx_Deva_mni_Mtei path: IN22GenBitextMining/brx_Deva_mni_Mtei-* - split: brx_Deva_npi_Deva path: IN22GenBitextMining/brx_Deva_npi_Deva-* - split: brx_Deva_ory_Orya path: IN22GenBitextMining/brx_Deva_ory_Orya-* - split: brx_Deva_pan_Guru path: IN22GenBitextMining/brx_Deva_pan_Guru-* - split: brx_Deva_san_Deva path: IN22GenBitextMining/brx_Deva_san_Deva-* - split: brx_Deva_sat_Olck path: IN22GenBitextMining/brx_Deva_sat_Olck-* - split: brx_Deva_snd_Deva path: IN22GenBitextMining/brx_Deva_snd_Deva-* - split: brx_Deva_tam_Taml path: IN22GenBitextMining/brx_Deva_tam_Taml-* - split: brx_Deva_tel_Telu path: IN22GenBitextMining/brx_Deva_tel_Telu-* - split: brx_Deva_urd_Arab path: IN22GenBitextMining/brx_Deva_urd_Arab-* - split: doi_Deva_asm_Beng path: IN22GenBitextMining/doi_Deva_asm_Beng-* - split: doi_Deva_ben_Beng path: IN22GenBitextMining/doi_Deva_ben_Beng-* - split: doi_Deva_brx_Deva path: IN22GenBitextMining/doi_Deva_brx_Deva-* - split: doi_Deva_eng_Latn path: IN22GenBitextMining/doi_Deva_eng_Latn-* - split: doi_Deva_gom_Deva path: IN22GenBitextMining/doi_Deva_gom_Deva-* - split: doi_Deva_guj_Gujr path: IN22GenBitextMining/doi_Deva_guj_Gujr-* - split: doi_Deva_hin_Deva path: IN22GenBitextMining/doi_Deva_hin_Deva-* - split: doi_Deva_kan_Knda path: IN22GenBitextMining/doi_Deva_kan_Knda-* - split: doi_Deva_kas_Arab path: IN22GenBitextMining/doi_Deva_kas_Arab-* - split: doi_Deva_mai_Deva path: IN22GenBitextMining/doi_Deva_mai_Deva-* - split: doi_Deva_mal_Mlym path: IN22GenBitextMining/doi_Deva_mal_Mlym-* - split: doi_Deva_mar_Deva path: IN22GenBitextMining/doi_Deva_mar_Deva-* - split: doi_Deva_mni_Mtei path: IN22GenBitextMining/doi_Deva_mni_Mtei-* - split: doi_Deva_npi_Deva path: IN22GenBitextMining/doi_Deva_npi_Deva-* - split: doi_Deva_ory_Orya path: IN22GenBitextMining/doi_Deva_ory_Orya-* - split: doi_Deva_pan_Guru path: IN22GenBitextMining/doi_Deva_pan_Guru-* - split: doi_Deva_san_Deva path: IN22GenBitextMining/doi_Deva_san_Deva-* - split: doi_Deva_sat_Olck path: IN22GenBitextMining/doi_Deva_sat_Olck-* - split: doi_Deva_snd_Deva path: IN22GenBitextMining/doi_Deva_snd_Deva-* - split: doi_Deva_tam_Taml path: IN22GenBitextMining/doi_Deva_tam_Taml-* - split: doi_Deva_tel_Telu path: IN22GenBitextMining/doi_Deva_tel_Telu-* - split: doi_Deva_urd_Arab path: IN22GenBitextMining/doi_Deva_urd_Arab-* - split: eng_Latn_asm_Beng path: IN22GenBitextMining/eng_Latn_asm_Beng-* - split: eng_Latn_ben_Beng path: IN22GenBitextMining/eng_Latn_ben_Beng-* - split: eng_Latn_brx_Deva path: IN22GenBitextMining/eng_Latn_brx_Deva-* - split: eng_Latn_doi_Deva path: IN22GenBitextMining/eng_Latn_doi_Deva-* - split: eng_Latn_gom_Deva path: IN22GenBitextMining/eng_Latn_gom_Deva-* - split: eng_Latn_guj_Gujr path: IN22GenBitextMining/eng_Latn_guj_Gujr-* - split: eng_Latn_hin_Deva path: IN22GenBitextMining/eng_Latn_hin_Deva-* - split: eng_Latn_kan_Knda path: IN22GenBitextMining/eng_Latn_kan_Knda-* - split: eng_Latn_kas_Arab path: IN22GenBitextMining/eng_Latn_kas_Arab-* - split: eng_Latn_mai_Deva path: IN22GenBitextMining/eng_Latn_mai_Deva-* - split: eng_Latn_mal_Mlym path: IN22GenBitextMining/eng_Latn_mal_Mlym-* - split: eng_Latn_mar_Deva path: IN22GenBitextMining/eng_Latn_mar_Deva-* - split: eng_Latn_mni_Mtei path: IN22GenBitextMining/eng_Latn_mni_Mtei-* - split: eng_Latn_npi_Deva path: IN22GenBitextMining/eng_Latn_npi_Deva-* - split: eng_Latn_ory_Orya path: IN22GenBitextMining/eng_Latn_ory_Orya-* - split: eng_Latn_pan_Guru path: IN22GenBitextMining/eng_Latn_pan_Guru-* - split: eng_Latn_san_Deva path: IN22GenBitextMining/eng_Latn_san_Deva-* - split: eng_Latn_sat_Olck path: IN22GenBitextMining/eng_Latn_sat_Olck-* - split: eng_Latn_snd_Deva path: IN22GenBitextMining/eng_Latn_snd_Deva-* - split: eng_Latn_tam_Taml path: IN22GenBitextMining/eng_Latn_tam_Taml-* - split: eng_Latn_tel_Telu path: IN22GenBitextMining/eng_Latn_tel_Telu-* - split: eng_Latn_urd_Arab path: IN22GenBitextMining/eng_Latn_urd_Arab-* - split: gom_Deva_asm_Beng path: IN22GenBitextMining/gom_Deva_asm_Beng-* - split: gom_Deva_ben_Beng path: IN22GenBitextMining/gom_Deva_ben_Beng-* - split: gom_Deva_brx_Deva path: IN22GenBitextMining/gom_Deva_brx_Deva-* - split: gom_Deva_doi_Deva path: IN22GenBitextMining/gom_Deva_doi_Deva-* - split: gom_Deva_eng_Latn path: IN22GenBitextMining/gom_Deva_eng_Latn-* - split: gom_Deva_guj_Gujr path: IN22GenBitextMining/gom_Deva_guj_Gujr-* - split: gom_Deva_hin_Deva path: IN22GenBitextMining/gom_Deva_hin_Deva-* - split: gom_Deva_kan_Knda path: IN22GenBitextMining/gom_Deva_kan_Knda-* - split: gom_Deva_kas_Arab path: IN22GenBitextMining/gom_Deva_kas_Arab-* - split: gom_Deva_mai_Deva path: IN22GenBitextMining/gom_Deva_mai_Deva-* - split: gom_Deva_mal_Mlym path: IN22GenBitextMining/gom_Deva_mal_Mlym-* - split: gom_Deva_mar_Deva path: IN22GenBitextMining/gom_Deva_mar_Deva-* - split: gom_Deva_mni_Mtei path: IN22GenBitextMining/gom_Deva_mni_Mtei-* - split: gom_Deva_npi_Deva path: IN22GenBitextMining/gom_Deva_npi_Deva-* - split: gom_Deva_ory_Orya path: IN22GenBitextMining/gom_Deva_ory_Orya-* - split: gom_Deva_pan_Guru path: IN22GenBitextMining/gom_Deva_pan_Guru-* - split: gom_Deva_san_Deva path: IN22GenBitextMining/gom_Deva_san_Deva-* - split: gom_Deva_sat_Olck path: IN22GenBitextMining/gom_Deva_sat_Olck-* - config_name: IndicGenBenchFloresBitextMining data_files: - split: asm_eng path: IndicGenBenchFloresBitextMining/asm_eng-* - split: awa_eng path: IndicGenBenchFloresBitextMining/awa_eng-* - split: ben_eng path: IndicGenBenchFloresBitextMining/ben_eng-* - split: bgc_eng path: IndicGenBenchFloresBitextMining/bgc_eng-* - split: bho_eng path: IndicGenBenchFloresBitextMining/bho_eng-* - split: bod_eng path: IndicGenBenchFloresBitextMining/bod_eng-* - split: boy_eng path: IndicGenBenchFloresBitextMining/boy_eng-* - split: eng_asm path: IndicGenBenchFloresBitextMining/eng_asm-* - split: eng_awa path: IndicGenBenchFloresBitextMining/eng_awa-* - split: eng_ben path: IndicGenBenchFloresBitextMining/eng_ben-* - split: eng_bgc path: IndicGenBenchFloresBitextMining/eng_bgc-* - split: eng_bho path: IndicGenBenchFloresBitextMining/eng_bho-* - split: eng_bod path: IndicGenBenchFloresBitextMining/eng_bod-* - split: eng_boy path: IndicGenBenchFloresBitextMining/eng_boy-* - split: eng_gbm path: IndicGenBenchFloresBitextMining/eng_gbm-* - split: eng_gom path: IndicGenBenchFloresBitextMining/eng_gom-* - split: eng_guj path: IndicGenBenchFloresBitextMining/eng_guj-* - split: eng_hin path: IndicGenBenchFloresBitextMining/eng_hin-* - split: eng_hne path: IndicGenBenchFloresBitextMining/eng_hne-* - split: eng_kan path: IndicGenBenchFloresBitextMining/eng_kan-* - split: eng_mai path: IndicGenBenchFloresBitextMining/eng_mai-* - split: eng_mal path: IndicGenBenchFloresBitextMining/eng_mal-* - split: eng_mar path: IndicGenBenchFloresBitextMining/eng_mar-* - split: eng_mni path: IndicGenBenchFloresBitextMining/eng_mni-* - split: eng_mup path: IndicGenBenchFloresBitextMining/eng_mup-* - split: eng_mwr path: IndicGenBenchFloresBitextMining/eng_mwr-* - split: eng_nep path: IndicGenBenchFloresBitextMining/eng_nep-* - split: eng_ory path: IndicGenBenchFloresBitextMining/eng_ory-* - split: eng_pan path: IndicGenBenchFloresBitextMining/eng_pan-* - split: eng_pus path: IndicGenBenchFloresBitextMining/eng_pus-* - split: eng_raj path: IndicGenBenchFloresBitextMining/eng_raj-* - split: eng_san path: IndicGenBenchFloresBitextMining/eng_san-* - split: eng_sat path: IndicGenBenchFloresBitextMining/eng_sat-* - split: eng_tam path: IndicGenBenchFloresBitextMining/eng_tam-* - split: eng_tel path: IndicGenBenchFloresBitextMining/eng_tel-* - split: eng_urd path: IndicGenBenchFloresBitextMining/eng_urd-* - split: gbm_eng path: IndicGenBenchFloresBitextMining/gbm_eng-* - split: gom_eng path: IndicGenBenchFloresBitextMining/gom_eng-* - split: guj_eng path: IndicGenBenchFloresBitextMining/guj_eng-* - split: hin_eng path: IndicGenBenchFloresBitextMining/hin_eng-* - split: hne_eng path: IndicGenBenchFloresBitextMining/hne_eng-* - split: kan_eng path: IndicGenBenchFloresBitextMining/kan_eng-* - split: mai_eng path: IndicGenBenchFloresBitextMining/mai_eng-* - split: mal_eng path: IndicGenBenchFloresBitextMining/mal_eng-* - split: mar_eng path: IndicGenBenchFloresBitextMining/mar_eng-* - split: mni_eng path: IndicGenBenchFloresBitextMining/mni_eng-* - split: mup_eng path: IndicGenBenchFloresBitextMining/mup_eng-* - split: mwr_eng path: IndicGenBenchFloresBitextMining/mwr_eng-* - split: nep_eng path: IndicGenBenchFloresBitextMining/nep_eng-* - split: ory_eng path: IndicGenBenchFloresBitextMining/ory_eng-* - split: pan_eng path: IndicGenBenchFloresBitextMining/pan_eng-* - split: pus_eng path: IndicGenBenchFloresBitextMining/pus_eng-* - split: raj_eng path: IndicGenBenchFloresBitextMining/raj_eng-* - split: san_eng path: IndicGenBenchFloresBitextMining/san_eng-* - split: sat_eng path: IndicGenBenchFloresBitextMining/sat_eng-* - split: tam_eng path: IndicGenBenchFloresBitextMining/tam_eng-* - split: tel_eng path: IndicGenBenchFloresBitextMining/tel_eng-* - split: urd_eng path: IndicGenBenchFloresBitextMining/urd_eng-* - config_name: NollySentiBitextMining data_files: - split: en_ha path: NollySentiBitextMining/en_ha-* - split: en_ig path: NollySentiBitextMining/en_ig-* - split: en_pcm path: NollySentiBitextMining/en_pcm-* - split: en_yo path: NollySentiBitextMining/en_yo-* - config_name: NorwegianCourtsBitextMining data_files: - split: default path: NorwegianCourtsBitextMining/default-* - config_name: NusaTranslationBitextMining data_files: - split: ind_abs path: NusaTranslationBitextMining/ind_abs-* - split: ind_bew path: NusaTranslationBitextMining/ind_bew-* - split: ind_bhp path: NusaTranslationBitextMining/ind_bhp-* - split: ind_btk path: NusaTranslationBitextMining/ind_btk-* - split: ind_jav path: NusaTranslationBitextMining/ind_jav-* - split: ind_mad path: NusaTranslationBitextMining/ind_mad-* - split: ind_mak path: NusaTranslationBitextMining/ind_mak-* - split: ind_min path: NusaTranslationBitextMining/ind_min-* - split: ind_mui path: NusaTranslationBitextMining/ind_mui-* - split: ind_rej path: NusaTranslationBitextMining/ind_rej-* - split: ind_sun path: NusaTranslationBitextMining/ind_sun-* - config_name: NusaXBitextMining data_files: - split: eng_ace path: NusaXBitextMining/eng_ace-* - split: eng_ban path: NusaXBitextMining/eng_ban-* - split: eng_bbc path: NusaXBitextMining/eng_bbc-* - split: eng_bjn path: NusaXBitextMining/eng_bjn-* - split: eng_bug path: NusaXBitextMining/eng_bug-* - split: eng_ind path: NusaXBitextMining/eng_ind-* - split: eng_jav path: NusaXBitextMining/eng_jav-* - split: eng_mad path: NusaXBitextMining/eng_mad-* - split: eng_min path: NusaXBitextMining/eng_min-* - split: eng_nij path: NusaXBitextMining/eng_nij-* - split: eng_sun path: NusaXBitextMining/eng_sun-* - config_name: Tatoeba data_files: - split: sqi_eng path: Tatoeba/sqi_eng-* - split: fry_eng path: Tatoeba/fry_eng-* - split: kur_eng path: Tatoeba/kur_eng-* - split: tur_eng path: Tatoeba/tur_eng-* - split: deu_eng path: Tatoeba/deu_eng-* - split: nld_eng path: Tatoeba/nld_eng-* - split: ron_eng path: Tatoeba/ron_eng-* - split: ang_eng path: Tatoeba/ang_eng-* - split: ido_eng path: Tatoeba/ido_eng-* - split: jav_eng path: Tatoeba/jav_eng-* - split: isl_eng path: Tatoeba/isl_eng-* - split: slv_eng path: Tatoeba/slv_eng-* - split: cym_eng path: Tatoeba/cym_eng-* - split: kaz_eng path: Tatoeba/kaz_eng-* - split: est_eng path: Tatoeba/est_eng-* - split: heb_eng path: Tatoeba/heb_eng-* - split: gla_eng path: Tatoeba/gla_eng-* - split: mar_eng path: Tatoeba/mar_eng-* - split: lat_eng path: Tatoeba/lat_eng-* - split: bel_eng path: Tatoeba/bel_eng-* - split: pms_eng path: Tatoeba/pms_eng-* - split: gle_eng path: Tatoeba/gle_eng-* - split: pes_eng path: Tatoeba/pes_eng-* - split: nob_eng path: Tatoeba/nob_eng-* - split: bul_eng path: Tatoeba/bul_eng-* - split: cbk_eng path: Tatoeba/cbk_eng-* - split: hun_eng path: Tatoeba/hun_eng-* - split: uig_eng path: Tatoeba/uig_eng-* - split: rus_eng path: Tatoeba/rus_eng-* - split: spa_eng path: Tatoeba/spa_eng-* - split: hye_eng path: Tatoeba/hye_eng-* - split: tel_eng path: Tatoeba/tel_eng-* - split: afr_eng path: Tatoeba/afr_eng-* - split: mon_eng path: Tatoeba/mon_eng-* - split: arz_eng path: Tatoeba/arz_eng-* - split: hrv_eng path: Tatoeba/hrv_eng-* - split: nov_eng path: Tatoeba/nov_eng-* - split: gsw_eng path: Tatoeba/gsw_eng-* - split: nds_eng path: Tatoeba/nds_eng-* - split: ukr_eng path: Tatoeba/ukr_eng-* - split: uzb_eng path: Tatoeba/uzb_eng-* - split: lit_eng path: Tatoeba/lit_eng-* - split: ina_eng path: Tatoeba/ina_eng-* - split: lfn_eng path: Tatoeba/lfn_eng-* - split: zsm_eng path: Tatoeba/zsm_eng-* - split: ita_eng path: Tatoeba/ita_eng-* - split: cmn_eng path: Tatoeba/cmn_eng-* - split: lvs_eng path: Tatoeba/lvs_eng-* - split: glg_eng path: Tatoeba/glg_eng-* - split: ceb_eng path: Tatoeba/ceb_eng-* - split: bre_eng path: Tatoeba/bre_eng-* - split: ben_eng path: Tatoeba/ben_eng-* - split: swg_eng path: Tatoeba/swg_eng-* - split: arq_eng path: Tatoeba/arq_eng-* - split: kab_eng path: Tatoeba/kab_eng-* - split: fra_eng path: Tatoeba/fra_eng-* - split: por_eng path: Tatoeba/por_eng-* - split: tat_eng path: Tatoeba/tat_eng-* - split: oci_eng path: Tatoeba/oci_eng-* - split: pol_eng path: Tatoeba/pol_eng-* - split: war_eng path: Tatoeba/war_eng-* - split: aze_eng path: Tatoeba/aze_eng-* - split: vie_eng path: Tatoeba/vie_eng-* - split: nno_eng path: Tatoeba/nno_eng-* - split: cha_eng path: Tatoeba/cha_eng-* - split: mhr_eng path: Tatoeba/mhr_eng-* - split: dan_eng path: Tatoeba/dan_eng-* - split: ell_eng path: Tatoeba/ell_eng-* - split: amh_eng path: Tatoeba/amh_eng-* - split: pam_eng path: Tatoeba/pam_eng-* - split: hsb_eng path: Tatoeba/hsb_eng-* - split: srp_eng path: Tatoeba/srp_eng-* - split: epo_eng path: Tatoeba/epo_eng-* - split: kzj_eng path: Tatoeba/kzj_eng-* - split: awa_eng path: Tatoeba/awa_eng-* - split: fao_eng path: Tatoeba/fao_eng-* - split: mal_eng path: Tatoeba/mal_eng-* - split: ile_eng path: Tatoeba/ile_eng-* - split: bos_eng path: Tatoeba/bos_eng-* - split: cor_eng path: Tatoeba/cor_eng-* - split: cat_eng path: Tatoeba/cat_eng-* - split: eus_eng path: Tatoeba/eus_eng-* - split: yue_eng path: Tatoeba/yue_eng-* - split: swe_eng path: Tatoeba/swe_eng-* - split: dtp_eng path: Tatoeba/dtp_eng-* - split: kat_eng path: Tatoeba/kat_eng-* - split: jpn_eng path: Tatoeba/jpn_eng-* - split: csb_eng path: Tatoeba/csb_eng-* - split: xho_eng path: Tatoeba/xho_eng-* - split: orv_eng path: Tatoeba/orv_eng-* - split: ind_eng path: Tatoeba/ind_eng-* - split: tuk_eng path: Tatoeba/tuk_eng-* - split: max_eng path: Tatoeba/max_eng-* - split: swh_eng path: Tatoeba/swh_eng-* - split: hin_eng path: Tatoeba/hin_eng-* - split: dsb_eng path: Tatoeba/dsb_eng-* - split: ber_eng path: Tatoeba/ber_eng-* - split: tam_eng path: Tatoeba/tam_eng-* - split: slk_eng path: Tatoeba/slk_eng-* - split: tgl_eng path: Tatoeba/tgl_eng-* - split: ast_eng path: Tatoeba/ast_eng-* - split: mkd_eng path: Tatoeba/mkd_eng-* - split: khm_eng path: Tatoeba/khm_eng-* - split: ces_eng path: Tatoeba/ces_eng-* - split: tzl_eng path: Tatoeba/tzl_eng-* - split: urd_eng path: Tatoeba/urd_eng-* - split: ara_eng path: Tatoeba/ara_eng-* - split: kor_eng path: Tatoeba/kor_eng-* - split: yid_eng path: Tatoeba/yid_eng-* - split: fin_eng path: Tatoeba/fin_eng-* - split: tha_eng path: Tatoeba/tha_eng-* - split: wuu_eng path: Tatoeba/wuu_eng-* - config_name: default data_files: - split: train path: data/train-* dataset_info: - config_name: BUCC_v2 features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: fr_en num_bytes: 2127711 num_examples: 9086 - name: ru_en num_bytes: 4713530 num_examples: 14435 - name: de_en num_bytes: 2373378 num_examples: 9580 - name: zh_en num_bytes: 425398 num_examples: 1899 download_size: 4995323 dataset_size: 9640017 - config_name: BornholmBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: default num_bytes: 905545 num_examples: 6785 download_size: 331753 dataset_size: 905545 - config_name: DiaBlaBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: en_fr num_bytes: 843941 num_examples: 5748 - name: fr_en num_bytes: 843941 num_examples: 5748 download_size: 725286 dataset_size: 1687882 - config_name: IN22GenBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: asm_Beng_ben_Beng num_bytes: 915680 num_examples: 1024 - name: asm_Beng_brx_Deva num_bytes: 949101 num_examples: 1024 - name: asm_Beng_doi_Deva num_bytes: 921310 num_examples: 1024 - name: asm_Beng_eng_Latn num_bytes: 672370 num_examples: 1024 - name: asm_Beng_gom_Deva num_bytes: 915888 num_examples: 1024 - name: asm_Beng_guj_Gujr num_bytes: 905828 num_examples: 1024 - name: asm_Beng_hin_Deva num_bytes: 924168 num_examples: 1024 - name: asm_Beng_kan_Knda num_bytes: 982934 num_examples: 1024 - name: asm_Beng_kas_Arab num_bytes: 807167 num_examples: 1024 - name: asm_Beng_mai_Deva num_bytes: 904788 num_examples: 1024 - name: asm_Beng_mal_Mlym num_bytes: 1019092 num_examples: 1024 - name: asm_Beng_mar_Deva num_bytes: 945191 num_examples: 1024 - name: asm_Beng_mni_Mtei num_bytes: 920032 num_examples: 1024 - name: asm_Beng_npi_Deva num_bytes: 925943 num_examples: 1024 - name: asm_Beng_ory_Orya num_bytes: 978870 num_examples: 1024 - name: asm_Beng_pan_Guru num_bytes: 883929 num_examples: 1024 - name: asm_Beng_san_Deva num_bytes: 942149 num_examples: 1024 - name: asm_Beng_sat_Olck num_bytes: 955884 num_examples: 1024 - name: asm_Beng_snd_Deva num_bytes: 924307 num_examples: 1024 - name: asm_Beng_tam_Taml num_bytes: 1026680 num_examples: 1024 - name: asm_Beng_tel_Telu num_bytes: 938937 num_examples: 1024 - name: asm_Beng_urd_Arab num_bytes: 788703 num_examples: 1024 - name: ben_Beng_asm_Beng num_bytes: 915680 num_examples: 1024 - name: ben_Beng_brx_Deva num_bytes: 922319 num_examples: 1024 - name: ben_Beng_doi_Deva num_bytes: 894528 num_examples: 1024 - name: ben_Beng_eng_Latn num_bytes: 645588 num_examples: 1024 - name: ben_Beng_gom_Deva num_bytes: 889106 num_examples: 1024 - name: ben_Beng_guj_Gujr num_bytes: 879046 num_examples: 1024 - name: ben_Beng_hin_Deva num_bytes: 897386 num_examples: 1024 - name: ben_Beng_kan_Knda num_bytes: 956152 num_examples: 1024 - name: ben_Beng_kas_Arab num_bytes: 780385 num_examples: 1024 - name: ben_Beng_mai_Deva num_bytes: 878006 num_examples: 1024 - name: ben_Beng_mal_Mlym num_bytes: 992310 num_examples: 1024 - name: ben_Beng_mar_Deva num_bytes: 918409 num_examples: 1024 - name: ben_Beng_mni_Mtei num_bytes: 893250 num_examples: 1024 - name: ben_Beng_npi_Deva num_bytes: 899161 num_examples: 1024 - name: ben_Beng_ory_Orya num_bytes: 952088 num_examples: 1024 - name: ben_Beng_pan_Guru num_bytes: 857147 num_examples: 1024 - name: ben_Beng_san_Deva num_bytes: 915367 num_examples: 1024 - name: ben_Beng_sat_Olck num_bytes: 929102 num_examples: 1024 - name: ben_Beng_snd_Deva num_bytes: 897525 num_examples: 1024 - name: ben_Beng_tam_Taml num_bytes: 999898 num_examples: 1024 - name: ben_Beng_tel_Telu num_bytes: 912155 num_examples: 1024 - name: ben_Beng_urd_Arab num_bytes: 761921 num_examples: 1024 - name: brx_Deva_asm_Beng num_bytes: 949101 num_examples: 1024 - name: brx_Deva_ben_Beng num_bytes: 922319 num_examples: 1024 - name: brx_Deva_doi_Deva num_bytes: 927949 num_examples: 1024 - name: brx_Deva_eng_Latn num_bytes: 679009 num_examples: 1024 - name: brx_Deva_gom_Deva num_bytes: 922527 num_examples: 1024 - name: brx_Deva_guj_Gujr num_bytes: 912467 num_examples: 1024 - name: brx_Deva_hin_Deva num_bytes: 930807 num_examples: 1024 - name: brx_Deva_kan_Knda num_bytes: 989573 num_examples: 1024 - name: brx_Deva_kas_Arab num_bytes: 813806 num_examples: 1024 - name: brx_Deva_mai_Deva num_bytes: 911427 num_examples: 1024 - name: brx_Deva_mal_Mlym num_bytes: 1025731 num_examples: 1024 - name: brx_Deva_mar_Deva num_bytes: 951830 num_examples: 1024 - name: brx_Deva_mni_Mtei num_bytes: 926671 num_examples: 1024 - name: brx_Deva_npi_Deva num_bytes: 932582 num_examples: 1024 - name: brx_Deva_ory_Orya num_bytes: 985509 num_examples: 1024 - name: brx_Deva_pan_Guru num_bytes: 890568 num_examples: 1024 - name: brx_Deva_san_Deva num_bytes: 948788 num_examples: 1024 - name: brx_Deva_sat_Olck num_bytes: 962523 num_examples: 1024 - name: brx_Deva_snd_Deva num_bytes: 930946 num_examples: 1024 - name: brx_Deva_tam_Taml num_bytes: 1033319 num_examples: 1024 - name: brx_Deva_tel_Telu num_bytes: 945576 num_examples: 1024 - name: brx_Deva_urd_Arab num_bytes: 795342 num_examples: 1024 - name: doi_Deva_asm_Beng num_bytes: 921310 num_examples: 1024 - name: doi_Deva_ben_Beng num_bytes: 894528 num_examples: 1024 - name: doi_Deva_brx_Deva num_bytes: 927949 num_examples: 1024 - name: doi_Deva_eng_Latn num_bytes: 651218 num_examples: 1024 - name: doi_Deva_gom_Deva num_bytes: 894736 num_examples: 1024 - name: doi_Deva_guj_Gujr num_bytes: 884676 num_examples: 1024 - name: doi_Deva_hin_Deva num_bytes: 903016 num_examples: 1024 - name: doi_Deva_kan_Knda num_bytes: 961782 num_examples: 1024 - name: doi_Deva_kas_Arab num_bytes: 786015 num_examples: 1024 - name: doi_Deva_mai_Deva num_bytes: 883636 num_examples: 1024 - name: doi_Deva_mal_Mlym num_bytes: 997940 num_examples: 1024 - name: doi_Deva_mar_Deva num_bytes: 924039 num_examples: 1024 - name: doi_Deva_mni_Mtei num_bytes: 898880 num_examples: 1024 - name: doi_Deva_npi_Deva num_bytes: 904791 num_examples: 1024 - name: doi_Deva_ory_Orya num_bytes: 957718 num_examples: 1024 - name: doi_Deva_pan_Guru num_bytes: 862777 num_examples: 1024 - name: doi_Deva_san_Deva num_bytes: 920997 num_examples: 1024 - name: doi_Deva_sat_Olck num_bytes: 934732 num_examples: 1024 - name: doi_Deva_snd_Deva num_bytes: 903155 num_examples: 1024 - name: doi_Deva_tam_Taml num_bytes: 1005528 num_examples: 1024 - name: doi_Deva_tel_Telu num_bytes: 917785 num_examples: 1024 - name: doi_Deva_urd_Arab num_bytes: 767551 num_examples: 1024 - name: eng_Latn_asm_Beng num_bytes: 672370 num_examples: 1024 - name: eng_Latn_ben_Beng num_bytes: 645588 num_examples: 1024 - name: eng_Latn_brx_Deva num_bytes: 679009 num_examples: 1024 - name: eng_Latn_doi_Deva num_bytes: 651218 num_examples: 1024 - name: eng_Latn_gom_Deva num_bytes: 645796 num_examples: 1024 - name: eng_Latn_guj_Gujr num_bytes: 635736 num_examples: 1024 - name: eng_Latn_hin_Deva num_bytes: 654076 num_examples: 1024 - name: eng_Latn_kan_Knda num_bytes: 712842 num_examples: 1024 - name: eng_Latn_kas_Arab num_bytes: 537075 num_examples: 1024 - name: eng_Latn_mai_Deva num_bytes: 634696 num_examples: 1024 - name: eng_Latn_mal_Mlym num_bytes: 749000 num_examples: 1024 - name: eng_Latn_mar_Deva num_bytes: 675099 num_examples: 1024 - name: eng_Latn_mni_Mtei num_bytes: 649940 num_examples: 1024 - name: eng_Latn_npi_Deva num_bytes: 655851 num_examples: 1024 - name: eng_Latn_ory_Orya num_bytes: 708778 num_examples: 1024 - name: eng_Latn_pan_Guru num_bytes: 613837 num_examples: 1024 - name: eng_Latn_san_Deva num_bytes: 672057 num_examples: 1024 - name: eng_Latn_sat_Olck num_bytes: 685792 num_examples: 1024 - name: eng_Latn_snd_Deva num_bytes: 654215 num_examples: 1024 - name: eng_Latn_tam_Taml num_bytes: 756588 num_examples: 1024 - name: eng_Latn_tel_Telu num_bytes: 668845 num_examples: 1024 - name: eng_Latn_urd_Arab num_bytes: 518611 num_examples: 1024 - name: gom_Deva_asm_Beng num_bytes: 915888 num_examples: 1024 - name: gom_Deva_ben_Beng num_bytes: 889106 num_examples: 1024 - name: gom_Deva_brx_Deva num_bytes: 922527 num_examples: 1024 - name: gom_Deva_doi_Deva num_bytes: 894736 num_examples: 1024 - name: gom_Deva_eng_Latn num_bytes: 645796 num_examples: 1024 - name: gom_Deva_guj_Gujr num_bytes: 879254 num_examples: 1024 - name: gom_Deva_hin_Deva num_bytes: 897594 num_examples: 1024 - name: gom_Deva_kan_Knda num_bytes: 956360 num_examples: 1024 - name: gom_Deva_kas_Arab num_bytes: 780593 num_examples: 1024 - name: gom_Deva_mai_Deva num_bytes: 878214 num_examples: 1024 - name: gom_Deva_mal_Mlym num_bytes: 992518 num_examples: 1024 - name: gom_Deva_mar_Deva num_bytes: 918617 num_examples: 1024 - name: gom_Deva_mni_Mtei num_bytes: 893458 num_examples: 1024 - name: gom_Deva_npi_Deva num_bytes: 899369 num_examples: 1024 - name: gom_Deva_ory_Orya num_bytes: 952296 num_examples: 1024 - name: gom_Deva_pan_Guru num_bytes: 857355 num_examples: 1024 - name: gom_Deva_san_Deva num_bytes: 915575 num_examples: 1024 - name: gom_Deva_sat_Olck num_bytes: 929310 num_examples: 1024 download_size: 43694888 dataset_size: 110224194 - config_name: IndicGenBenchFloresBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: asm_eng num_bytes: 1056946 num_examples: 2009 - name: awa_eng num_bytes: 1048599 num_examples: 2009 - name: ben_eng num_bytes: 1074234 num_examples: 2009 - name: bgc_eng num_bytes: 1036830 num_examples: 2009 - name: bho_eng num_bytes: 1040988 num_examples: 2009 - name: bod_eng num_bytes: 1257640 num_examples: 2009 - name: boy_eng num_bytes: 1161403 num_examples: 2009 - name: eng_asm num_bytes: 1056968 num_examples: 2009 - name: eng_awa num_bytes: 1048493 num_examples: 2009 - name: eng_ben num_bytes: 1074234 num_examples: 2009 - name: eng_bgc num_bytes: 1036830 num_examples: 2009 - name: eng_bho num_bytes: 1041035 num_examples: 2009 - name: eng_bod num_bytes: 1257558 num_examples: 2009 - name: eng_boy num_bytes: 1161403 num_examples: 2009 - name: eng_gbm num_bytes: 1050717 num_examples: 2009 - name: eng_gom num_bytes: 1056819 num_examples: 2009 - name: eng_guj num_bytes: 1045771 num_examples: 2009 - name: eng_hin num_bytes: 1059315 num_examples: 2009 - name: eng_hne num_bytes: 1038653 num_examples: 2009 - name: eng_kan num_bytes: 1132860 num_examples: 2009 - name: eng_mai num_bytes: 1055548 num_examples: 2009 - name: eng_mal num_bytes: 1203090 num_examples: 2009 - name: eng_mar num_bytes: 1091179 num_examples: 2009 - name: eng_mni num_bytes: 1117401 num_examples: 2009 - name: eng_mup num_bytes: 1058706 num_examples: 2009 - name: eng_mwr num_bytes: 1062446 num_examples: 2009 - name: eng_nep num_bytes: 1062632 num_examples: 2009 - name: eng_ory num_bytes: 1107245 num_examples: 2009 - name: eng_pan num_bytes: 1070204 num_examples: 2009 - name: eng_pus num_bytes: 828135 num_examples: 2009 - name: eng_raj num_bytes: 1058173 num_examples: 2009 - name: eng_san num_bytes: 1080078 num_examples: 2009 - name: eng_sat num_bytes: 1123190 num_examples: 2009 - name: eng_tam num_bytes: 1220422 num_examples: 2009 - name: eng_tel num_bytes: 1092493 num_examples: 2009 - name: eng_urd num_bytes: 853498 num_examples: 2009 - name: gbm_eng num_bytes: 1050717 num_examples: 2009 - name: gom_eng num_bytes: 1056819 num_examples: 2009 - name: guj_eng num_bytes: 1045771 num_examples: 2009 - name: hin_eng num_bytes: 1059335 num_examples: 2009 - name: hne_eng num_bytes: 1038672 num_examples: 2009 - name: kan_eng num_bytes: 1133075 num_examples: 2009 - name: mai_eng num_bytes: 1055542 num_examples: 2009 - name: mal_eng num_bytes: 1203089 num_examples: 2009 - name: mar_eng num_bytes: 1091578 num_examples: 2009 - name: mni_eng num_bytes: 1117396 num_examples: 2009 - name: mup_eng num_bytes: 1058706 num_examples: 2009 - name: mwr_eng num_bytes: 1062446 num_examples: 2009 - name: nep_eng num_bytes: 1062632 num_examples: 2009 - name: ory_eng num_bytes: 1107245 num_examples: 2009 - name: pan_eng num_bytes: 1070205 num_examples: 2009 - name: pus_eng num_bytes: 828136 num_examples: 2009 - name: raj_eng num_bytes: 1058173 num_examples: 2009 - name: san_eng num_bytes: 1080358 num_examples: 2009 - name: sat_eng num_bytes: 1123190 num_examples: 2009 - name: tam_eng num_bytes: 1220422 num_examples: 2009 - name: tel_eng num_bytes: 1092511 num_examples: 2009 - name: urd_eng num_bytes: 853499 num_examples: 2009 download_size: 27552478 dataset_size: 62291253 - config_name: NollySentiBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: en_ha num_bytes: 141324 num_examples: 410 - name: en_ig num_bytes: 142407 num_examples: 410 - name: en_pcm num_bytes: 135981 num_examples: 410 - name: en_yo num_bytes: 157241 num_examples: 410 download_size: 310379 dataset_size: 576953 - config_name: NorwegianCourtsBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: default num_bytes: 265698 num_examples: 1137 download_size: 116989 dataset_size: 265698 - config_name: NusaTranslationBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: ind_abs num_bytes: 365680 num_examples: 1000 - name: ind_bew num_bytes: 2420537 num_examples: 6600 - name: ind_bhp num_bytes: 331696 num_examples: 1000 - name: ind_btk num_bytes: 2389908 num_examples: 6600 - name: ind_jav num_bytes: 2384271 num_examples: 6600 - name: ind_mad num_bytes: 2435301 num_examples: 6600 - name: ind_mak num_bytes: 2423126 num_examples: 6600 - name: ind_min num_bytes: 2399033 num_examples: 6600 - name: ind_mui num_bytes: 371449 num_examples: 1000 - name: ind_rej num_bytes: 368437 num_examples: 1000 - name: ind_sun num_bytes: 2418407 num_examples: 6600 download_size: 10574540 dataset_size: 18307845 - config_name: NusaXBitextMining features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: eng_ace num_bytes: 184722 num_examples: 500 - name: eng_ban num_bytes: 187380 num_examples: 500 - name: eng_bbc num_bytes: 189184 num_examples: 500 - name: eng_bjn num_bytes: 187328 num_examples: 500 - name: eng_bug num_bytes: 191552 num_examples: 500 - name: eng_ind num_bytes: 187480 num_examples: 500 - name: eng_jav num_bytes: 185271 num_examples: 500 - name: eng_mad num_bytes: 187942 num_examples: 500 - name: eng_min num_bytes: 184912 num_examples: 500 - name: eng_nij num_bytes: 185800 num_examples: 500 - name: eng_sun num_bytes: 187025 num_examples: 500 download_size: 1164585 dataset_size: 2058596 - config_name: Tatoeba features: - name: sentence1 dtype: string - name: sentence2 dtype: string - name: lang dtype: string - name: source_dataset dtype: string - name: original_split dtype: string - name: config dtype: string splits: - name: sqi_eng num_bytes: 122146 num_examples: 1000 - name: fry_eng num_bytes: 20712 num_examples: 173 - name: kur_eng num_bytes: 45667 num_examples: 410 - name: tur_eng num_bytes: 120885 num_examples: 1000 - name: deu_eng num_bytes: 154567 num_examples: 1000 - name: nld_eng num_bytes: 122263 num_examples: 1000 - name: ron_eng num_bytes: 121484 num_examples: 1000 - name: ang_eng num_bytes: 21746 num_examples: 134 - name: ido_eng num_bytes: 114933 num_examples: 1000 - name: jav_eng num_bytes: 23958 num_examples: 205 - name: isl_eng num_bytes: 128821 num_examples: 1000 - name: slv_eng num_bytes: 93523 num_examples: 823 - name: cym_eng num_bytes: 60754 num_examples: 575 - name: kaz_eng num_bytes: 75353 num_examples: 575 - name: est_eng num_bytes: 108728 num_examples: 1000 - name: heb_eng num_bytes: 132500 num_examples: 1000 - name: gla_eng num_bytes: 96048 num_examples: 829 - name: mar_eng num_bytes: 149542 num_examples: 1000 - name: lat_eng num_bytes: 118854 num_examples: 1000 - name: bel_eng num_bytes: 146932 num_examples: 1000 - name: pms_eng num_bytes: 64255 num_examples: 525 - name: gle_eng num_bytes: 107423 num_examples: 1000 - name: pes_eng num_bytes: 142719 num_examples: 1000 - name: nob_eng num_bytes: 117481 num_examples: 1000 - name: bul_eng num_bytes: 151279 num_examples: 1000 - name: cbk_eng num_bytes: 111726 num_examples: 1000 - name: hun_eng num_bytes: 117889 num_examples: 1000 - name: uig_eng num_bytes: 139052 num_examples: 1000 - name: rus_eng num_bytes: 141472 num_examples: 1000 - name: spa_eng num_bytes: 121266 num_examples: 1000 - name: hye_eng num_bytes: 105044 num_examples: 742 - name: tel_eng num_bytes: 36609 num_examples: 234 - name: afr_eng num_bytes: 108635 num_examples: 1000 - name: mon_eng num_bytes: 66088 num_examples: 440 - name: arz_eng num_bytes: 54851 num_examples: 477 - name: hrv_eng num_bytes: 113934 num_examples: 1000 - name: nov_eng num_bytes: 26309 num_examples: 257 - name: gsw_eng num_bytes: 10438 num_examples: 117 - name: nds_eng num_bytes: 112458 num_examples: 1000 - name: ukr_eng num_bytes: 130931 num_examples: 1000 - name: uzb_eng num_bytes: 45744 num_examples: 428 - name: lit_eng num_bytes: 114243 num_examples: 1000 - name: ina_eng num_bytes: 136502 num_examples: 1000 - name: lfn_eng num_bytes: 120490 num_examples: 1000 - name: zsm_eng num_bytes: 128209 num_examples: 1000 - name: ita_eng num_bytes: 114833 num_examples: 1000 - name: cmn_eng num_bytes: 117931 num_examples: 1000 - name: lvs_eng num_bytes: 112206 num_examples: 1000 - name: glg_eng num_bytes: 125576 num_examples: 1000 - name: ceb_eng num_bytes: 65538 num_examples: 600 - name: bre_eng num_bytes: 100080 num_examples: 1000 - name: ben_eng num_bytes: 140703 num_examples: 1000 - name: swg_eng num_bytes: 12209 num_examples: 112 - name: arq_eng num_bytes: 122033 num_examples: 911 - name: kab_eng num_bytes: 111726 num_examples: 1000 - name: fra_eng num_bytes: 129018 num_examples: 1000 - name: por_eng num_bytes: 124185 num_examples: 1000 - name: tat_eng num_bytes: 141546 num_examples: 1000 - name: oci_eng num_bytes: 110752 num_examples: 1000 - name: pol_eng num_bytes: 120148 num_examples: 1000 - name: war_eng num_bytes: 120826 num_examples: 1000 - name: aze_eng num_bytes: 108400 num_examples: 1000 - name: vie_eng num_bytes: 140407 num_examples: 1000 - name: nno_eng num_bytes: 117572 num_examples: 1000 - name: cha_eng num_bytes: 17929 num_examples: 137 - name: mhr_eng num_bytes: 134104 num_examples: 1000 - name: dan_eng num_bytes: 119826 num_examples: 1000 - name: ell_eng num_bytes: 127961 num_examples: 1000 - name: amh_eng num_bytes: 18104 num_examples: 168 - name: pam_eng num_bytes: 112464 num_examples: 1000 - name: hsb_eng num_bytes: 52308 num_examples: 483 - name: srp_eng num_bytes: 122115 num_examples: 1000 - name: epo_eng num_bytes: 133841 num_examples: 1000 - name: kzj_eng num_bytes: 122309 num_examples: 1000 - name: awa_eng num_bytes: 27596 num_examples: 231 - name: fao_eng num_bytes: 29226 num_examples: 262 - name: mal_eng num_bytes: 128380 num_examples: 687 - name: ile_eng num_bytes: 99680 num_examples: 1000 - name: bos_eng num_bytes: 35066 num_examples: 354 - name: cor_eng num_bytes: 93849 num_examples: 1000 - name: cat_eng num_bytes: 122815 num_examples: 1000 - name: eus_eng num_bytes: 115068 num_examples: 1000 - name: yue_eng num_bytes: 124086 num_examples: 1000 - name: swe_eng num_bytes: 111248 num_examples: 1000 - name: dtp_eng num_bytes: 114064 num_examples: 1000 - name: kat_eng num_bytes: 108714 num_examples: 746 - name: jpn_eng num_bytes: 142083 num_examples: 1000 - name: csb_eng num_bytes: 29584 num_examples: 253 - name: xho_eng num_bytes: 14385 num_examples: 142 - name: orv_eng num_bytes: 101942 num_examples: 835 - name: ind_eng num_bytes: 123844 num_examples: 1000 - name: tuk_eng num_bytes: 19169 num_examples: 203 - name: max_eng num_bytes: 31590 num_examples: 284 - name: swh_eng num_bytes: 39577 num_examples: 390 - name: hin_eng num_bytes: 171558 num_examples: 1000 - name: dsb_eng num_bytes: 52150 num_examples: 479 - name: ber_eng num_bytes: 112627 num_examples: 1000 - name: tam_eng num_bytes: 54484 num_examples: 307 - name: slk_eng num_bytes: 109760 num_examples: 1000 - name: tgl_eng num_bytes: 117138 num_examples: 1000 - name: ast_eng num_bytes: 15506 num_examples: 127 - name: mkd_eng num_bytes: 132826 num_examples: 1000 - name: khm_eng num_bytes: 109362 num_examples: 722 - name: ces_eng num_bytes: 113396 num_examples: 1000 - name: tzl_eng num_bytes: 9339 num_examples: 104 - name: urd_eng num_bytes: 137712 num_examples: 1000 - name: ara_eng num_bytes: 121650 num_examples: 1000 - name: kor_eng num_bytes: 128139 num_examples: 1000 - name: yid_eng num_bytes: 122356 num_examples: 848 - name: fin_eng num_bytes: 124669 num_examples: 1000 - name: tha_eng num_bytes: 90050 num_examples: 548 - name: wuu_eng num_bytes: 123826 num_examples: 1000 download_size: 5020276 dataset_size: 11059627 --- # MTEB BitextMining Aggregated Dataset (Full) This dataset aggregates **ALL configs** from 10 BitextMining datasets in the MTEB (Massive Text Embedding Benchmark) Multilingual v2 benchmark into a single, unified dataset for comprehensive bitext mining evaluation. ## Dataset Summary - **Total Examples**: 448,229 sentence pairs - **Source Datasets (Configs)**: 10 MTEB BitextMining tasks - **Total Splits**: 332 language pairs/configurations - **Languages**: 300+ unique language codes across all datasets - **Task**: Bitext Mining (parallel sentence retrieval) - **Format**: Standardized schema across all sources ## Structure Each **source dataset** is a **config**, and each **original config** (language pair) within that dataset is a **split**. ### Example Usage ```python from datasets import load_dataset # Load specific config (source dataset) tatoeba = load_dataset("SaylorTwift/mteb-bitext-mining-aggregated", "Tatoeba") # This gives you 112 splits, one for each language pair # Access a specific language pair split french_english = tatoeba['fra-eng'] print(f"French-English pairs: {len(french_english)}") # Load another config indic = load_dataset("SaylorTwift/mteb-bitext-mining-aggregated", "IndicGenBenchFloresBitextMining") # This gives you 58 splits for different Indic language pairs # Access a split hindi_english = indic['hin-eng'] ``` ## Schema Each example contains: - `sentence1` (string): First sentence of the pair - `sentence2` (string): Second sentence of the pair (translation/parallel text) - `lang` (string): Language pair code (e.g., "fra-eng", "de-en") - `source_dataset` (string): Original MTEB dataset name - `original_split` (string): Original split name (train/validation/test) - `config` (string): Original config name ## Configs (Source Datasets) | Config | Splits | Examples | Description | |--------|--------|----------|-------------| | **Tatoeba** | 112 | 88,877 | Tatoeba sentence pairs across 112 language pairs | | **IN22GenBitextMining** | 128 | 131,072 | Indic language pairs (23 languages, all combinations) | | **IndicGenBenchFloresBitextMining** | 58 | 116,522 | Indic languages with English from Flores | | **NusaTranslationBitextMining** | 11 | 50,200 | Indonesian regional language pairs | | **BUCC_v2** | 4 | 35,000 | BUCC bitext mining (de-en, fr-en, ru-en, zh-en) | | **DiaBlaBitextMining** | 2 | 11,496 | English-French dialogue pairs (both directions) | | **BornholmBitextMining** | 1 | 6,785 | Danish dialect pairs | | **NusaXBitextMining** | 11 | 5,500 | Indonesian languages with English | | **NollySentiBitextMining** | 4 | 1,640 | Nigerian languages with English | | **NorwegianCourtsBitextMining** | 1 | 1,137 | Norwegian court document pairs | **Total**: 10 configs, 332 splits, 448,229 examples ## Example Splits by Config ### Tatoeba (112 language pairs) `sqi-eng`, `fry-eng`, `kur-eng`, `tur-eng`, `deu-eng`, `ell-eng`, `spa-eng`, `fra-eng`, `ita-eng`, `jpn-eng`, `cmn-eng`, `kor-eng`, `ara-eng`, `rus-eng`, `por-eng`, `hin-eng`, etc. ### IN22GenBitextMining (128 Indic pairs) `asm_Beng-ben_Beng`, `asm_Beng-eng_Latn`, `ben_Beng-hin_Deva`, `guj_Gujr-mar_Deva`, etc. (all combinations of 23 Indic languages) ### IndicGenBenchFloresBitextMining (58 pairs) `asm-eng`, `awa-eng`, `ben-eng`, `bgc-eng`, `bho-eng`, `bod-eng`, `guj-eng`, `hin-eng`, `kan-eng`, `mal-eng`, `mar-eng`, `nep-eng`, `ory-eng`, `pan-eng`, `tam-eng`, `tel-eng`, `urd-eng`, etc. ### BUCC_v2 (4 language pairs) `de-en`, `fr-en`, `ru-en`, `zh-en` ### NusaTranslationBitextMining (11 Indonesian languages) `ind-abs`, `ind-bew`, `ind-bhp`, `ind-btk`, `ind-jav`, `ind-mad`, `ind-mak`, `ind-min`, `ind-mui`, `ind-rej`, `ind-sun` ### NusaXBitextMining (11 pairs) `eng-ace`, `eng-ban`, `eng-bbc`, `eng-bjn`, `eng-bug`, `eng-ind`, `eng-jav`, `eng-mad`, `eng-min`, `eng-nij`, `eng-sun` ## Usage Examples ### Load all language pairs from a specific source ```python from datasets import load_dataset # Load all Tatoeba language pairs tatoeba = load_dataset("SaylorTwift/mteb-bitext-mining-aggregated", "Tatoeba") # Iterate through all language pairs for lang_pair, dataset in tatoeba.items(): print(f"{lang_pair}: {len(dataset)} pairs") ``` ### Load a specific language pair ```python # Load just German-English from BUCC bucc = load_dataset("SaylorTwift/mteb-bitext-mining-aggregated", "BUCC_v2") de_en = bucc['de-en'] for example in de_en: print(f"DE: {example['sentence1']}") print(f"EN: {example['sentence2']}") print() ``` ### Filter by language across all datasets ```python # Load Tatoeba tatoeba = load_dataset("SaylorTwift/mteb-bitext-mining-aggregated", "Tatoeba") # Get all examples for a specific language pair french_english = tatoeba['fra-eng'] print(f"Found {len(french_english)} French-English pairs") ``` ## Excluded Datasets **BibleNLPBitextMining** (828 configs, 900+ languages) was excluded due to incompatible schema that uses language codes as column names instead of the standard `sentence1`/`sentence2` format. **FloresBitextMining** and **NTREXBitextMining** were excluded in the previous version but may be revisitable with updated processing. ## Citation If you use this dataset, please cite the MTEB benchmark: ```bibtex @article{muennighoff2022mteb, title={MTEB: Massive Text Embedding Benchmark}, author={Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils}, journal={arXiv preprint arXiv:2210.07316}, year={2022} } ``` ## Individual Dataset Citations ### Tatoeba ```bibtex @inproceedings{artetxe2019massively, title={Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond}, author={Artetxe, Mikel and Schwenk, Holger}, booktitle={Transactions of the Association for Computational Linguistics}, year={2019} } ``` ### BUCC ```bibtex @inproceedings{zweigenbaum2017overview, title={Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora}, author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard}, booktitle={Proceedings of the 10th workshop on building and using comparable corpora}, year={2017} } ``` *Additional citations available in the original MTEB task metadata and individual dataset pages.* ## Dataset Statistics ### Language Coverage - **Total unique language codes**: 300+ - **Language families**: Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian, Dravidian, and many more - **Coverage**: High-resource (English, French, German, Spanish, Chinese, etc.), mid-resource (Hindi, Bengali, Tamil, etc.), and low-resource languages ### Split Distribution - **Total splits**: 332 (each representing a specific language pair or configuration) - **Examples per split**: Ranges from 228 to 8,750, with most splits containing 500-1,000 examples ### Data Quality - All sentence pairs have been validated to contain non-empty `sentence1` and `sentence2` fields - Language codes are preserved from original datasets - Source attribution maintained for every example ## License This aggregated dataset inherits the licenses from its source datasets. Most MTEB datasets are released under permissive licenses (Apache 2.0, MIT, CC-BY, etc.). Please refer to the original dataset pages for specific licensing information. ## Acknowledgments - **MTEB Team**: For creating and maintaining the benchmark - **Original Dataset Creators**: For providing high-quality bitext mining datasets - **Hugging Face**: For dataset hosting and infrastructure ## Version History - **v2.0 (2026-04-02)**: Full release - 10 source datasets (configs) - 332 splits (all language pairs) - 448,229 sentence pairs - 300+ language codes - **v1.0 (2026-04-02)**: Initial partial release (deprecated) - Only loaded default configs - 8 source datasets - 139,457 examples ## Contact For questions or issues with this aggregated dataset, please open an issue on the repository or contact the dataset creator.
提供机构:
SaylorTwift
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作