five

mteb/tatoeba-bitext-mining

收藏
Hugging Face2025-05-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mteb/tatoeba-bitext-mining
下载链接
链接失效反馈
官方服务:
资源简介:
Tatoeba是一个包含与英语对齐的句子对的多语言数据集。该数据集由人类标注,用于翻译任务。它包括多种语言,是MTEB(大规模文本嵌入基准)项目的一部分。数据集在CC BY 2.0许可下发布。README还提供了如何使用MTEB库在此数据集上评估模型的说明,并提供了数据集和MTEB项目的引用信息。

Tatoeba is a multilingual dataset containing sentence pairs aligned with English. The dataset is human-annotated and designed for translation tasks. It includes a variety of languages and is part of the MTEB (Massive Text Embedding Benchmark) project. The dataset is licensed under CC BY 2.0. The README also includes instructions on how to evaluate models on this dataset using the MTEB library and provides citation information for the dataset and the MTEB project.
提供机构:
mteb
原始信息汇总

数据集概述

语言支持

数据集支持以下语言:

  • eng
  • sqi
  • fry
  • kur
  • tur
  • deu
  • nld
  • ron
  • ang
  • ido
  • jav
  • isl
  • slv
  • cym
  • kaz
  • est
  • heb
  • gla
  • mar
  • lat
  • bel
  • pms
  • gle
  • pes
  • nob
  • bul
  • cbk
  • hun
  • uig
  • rus
  • spa
  • hye
  • tel
  • afr
  • mon
  • arz
  • hrv
  • nov
  • gsw
  • nds
  • ukr
  • uzb
  • lit
  • ina
  • lfn
  • zsm
  • ita
  • cmn
  • lvs
  • glg
  • ceb
  • bre
  • ben
  • swg
  • arq
  • kab
  • fra
  • por
  • tat
  • oci
  • pol
  • war
  • aze
  • vie
  • nno
  • cha
  • mhr
  • dan
  • ell
  • amh
  • pam
  • hsb
  • srp
  • epo
  • kzj
  • awa
  • fao
  • mal
  • ile
  • bos
  • cor
  • cat
  • eus
  • yue
  • swe
  • dtp
  • kat
  • jpn
  • csb
  • xho
  • orv
  • ind
  • tuk
  • max
  • swh
  • hin
  • dsb
  • ber
  • tam
  • slk
  • tgl
  • ast
  • mkd
  • khm
  • ces
  • tzl
  • urd
  • ara
  • kor
  • yid
  • fin
  • tha
  • wuu

配置文件

数据集包含多个配置文件,每个配置文件对应不同的语言对,所有配置文件均包含测试数据。以下是部分配置文件示例:

  • config_name: default

    • data_files:
      • split: test
        • path: "test/*"
  • config_name: sqi-eng

    • data_files:
      • split: test
        • path: "test/sqi-eng.jsonl.gz"
  • config_name: fry-eng

    • data_files:
      • split: test
        • path: "test/fry-eng.jsonl.gz"
  • config_name: kur-eng

    • data_files:
      • split: test
        • path: "test/kur-eng.jsonl.gz"
  • config_name: tur-eng

    • data_files:
      • split: test
        • path: "test/tur-eng.jsonl.gz"
  • config_name: deu-eng

    • data_files:
      • split: test
        • path: "test/deu-eng.jsonl.gz"
  • config_name: nld-eng

    • data_files:
      • split: test
        • path: "test/nld-eng.jsonl.gz"
  • config_name: ron-eng

    • data_files:
      • split: test
        • path: "test/ron-eng.jsonl.gz"
  • config_name: ang-eng

    • data_files:
      • split: test
        • path: "test/ang-eng.jsonl.gz"
  • config_name: ido-eng

    • data_files:
      • split: test
        • path: "test/ido-eng.jsonl.gz"
  • config_name: jav-eng

    • data_files:
      • split: test
        • path: "test/jav-eng.jsonl.gz"
  • config_name: isl-eng

    • data_files:
      • split: test
        • path: "test/isl-eng.jsonl.gz"
  • config_name: slv-eng

    • data_files:
      • split: test
        • path: "test/slv-eng.jsonl.gz"
  • config_name: cym-eng

    • data_files:
      • split: test
        • path: "test/cym-eng.jsonl.gz"
  • config_name: kaz-eng

    • data_files:
      • split: test
        • path: "test/kaz-eng.jsonl.gz"
  • config_name: est-eng

    • data_files:
      • split: test
        • path: "test/est-eng.jsonl.gz"
  • config_name: heb-eng

    • data_files:
      • split: test
        • path: "test/heb-eng.jsonl.gz"
  • config_name: gla-eng

    • data_files:
      • split: test
        • path: "test/gla-eng.jsonl.gz"
  • config_name: mar-eng

    • data_files:
      • split: test
        • path: "test/mar-eng.jsonl.gz"
  • config_name: lat-eng

    • data_files:
      • split: test
        • path: "test/lat-eng.jsonl.gz"
  • config_name: bel-eng

    • data_files:
      • split: test
        • path: "test/bel-eng.jsonl.gz"
  • config_name: pms-eng

    • data_files:
      • split: test
        • path: "test/pms-eng.jsonl.gz"
  • config_name: gle-eng

    • data_files:
      • split: test
        • path: "test/gle-eng.jsonl.gz"
  • config_name: pes-eng

    • data_files:
      • split: test
        • path: "test/pes-eng.jsonl.gz"
  • config_name: nob-eng

    • data_files:
      • split: test
        • path: "test/nob-eng.jsonl.gz"
  • config_name: bul-eng

    • data_files:
      • split: test
        • path: "test/bul-eng.jsonl.gz"
  • config_name: cbk-eng

    • data_files:
      • split: test
        • path: "test/cbk-eng.jsonl.gz"
  • config_name: hun-eng

    • data_files:
      • split: test
        • path: "test/hun-eng.jsonl.gz"
  • config_name: uig-eng

    • data_files:
      • split: test
        • path: "test/uig-eng.jsonl.gz"
  • config_name: rus-eng

    • data_files:
      • split: test
        • path: "test/rus-eng.jsonl.gz"
  • config_name: spa-eng

    • data_files:
      • split: test
        • path: "test/spa-eng.jsonl.gz"
  • config_name: hye-eng

    • data_files:
      • split: test
        • path: "test/hye-eng.jsonl.gz"
  • config_name: tel-eng

    • data_files:
      • split: test
        • path: "test/tel-eng.jsonl.gz"
  • config_name: afr-eng

    • data_files:
      • split: test
        • path: "test/afr-eng.jsonl.gz"
  • config_name: mon-eng

    • data_files:
      • split: test
        • path: "test/mon-eng.jsonl.gz"
  • config_name: arz-eng

    • data_files:
      • split: test
        • path: "test/arz-eng.jsonl.gz"
  • config_name: hrv-eng

    • data_files:
      • split: test
        • path: "test/hrv-eng.jsonl.gz"
  • config_name: nov-eng

    • data_files:
      • split: test
        • path: "test/nov-eng.jsonl.gz"
  • config_name: gsw-eng

    • data_files:
      • split: test
        • path: "test/gsw-eng.jsonl.gz"
  • config_name: nds-eng

    • data_files:
      • split: test
        • path: "test/nds-eng.jsonl.gz"
  • config_name: ukr-eng

    • data_files:
      • split: test
        • path: "test/ukr-eng.jsonl.gz"
  • config_name: uzb-eng

    • data_files:
      • split: test
        • path: "test/uzb-eng.jsonl.gz"
  • config_name: lit-eng

    • data_files:
      • split: test
        • path: "test/lit-eng.jsonl.gz"
  • config_name: ina-eng

    • data_files:
      • split: test
        • path: "test/ina-eng.jsonl.gz"
  • config_name: lfn-eng

    • data_files:
      • split: test
        • path: "test/lfn-eng.jsonl.gz"
  • config_name: zsm-eng

    • data_files:
      • split: test
        • path: "test/zsm-eng.jsonl.gz"
  • config_name: ita-eng

    • data_files:
      • split: test
        • path: "test/ita-eng.jsonl.gz"
  • config_name: cmn-eng

    • data_files:
      • split: test
        • path: "test/cmn-eng.jsonl.gz"
  • config_name: lvs-eng

    • data_files:
      • split: test
        • path: "test/lvs-eng.jsonl.gz"
  • config_name: glg-eng

    • data_files:
      • split: test
        • path: "test/glg-eng.jsonl.gz"
  • config_name: ceb-eng

    • data_files:
      • split: test
        • path: "test/ceb-eng.jsonl.gz"
  • config_name: bre-eng

    • data_files:
      • split: test
        • path: "test/bre-eng.jsonl.gz"
  • config_name: ben-eng

    • data_files:
      • split: test
        • path: "test/ben-eng.jsonl.gz"
  • config_name: swg-eng

    • data_files:
      • split: test
        • path: "test/swg-eng.jsonl.gz"
  • config_name: arq-eng

    • data_files:
      • split: test
        • path: "test/arq-eng.jsonl.gz"
  • config_name: kab-eng

    • data_files:
      • split: test
        • path: "test/kab-eng.jsonl.gz"
  • config_name: fra-eng

    • data_files:
      • split: test
        • path: "test/fra-eng.jsonl.gz"
  • config_name: por-eng

    • data_files:
      • split: test
        • path: "test/por-eng.jsonl.gz"
  • config_name: tat-eng

    • data_files:
      • split: test
        • path: "test/tat-eng.jsonl.gz"
  • config_name: oci-eng

    • data_files:
      • split: test
        • path: "test/oci-eng.jsonl.gz"
  • config_name: pol-eng

    • data_files:
      • split: test
        • path: "test/pol-eng.jsonl.gz"
  • config_name: war-eng

    • data_files:
      • split: test
        • path: "test/war-eng.jsonl.gz"
  • config_name: aze-eng

    • data_files:
      • split: test
        • path: "test/aze-eng.jsonl.gz"
  • config_name: vie-eng

    • data_files:
      • split: test
        • path: "test/vie-eng.jsonl.gz"
  • config_name: nno-eng

    • data_files:
      • split: test
        • path: "test/nno-eng.jsonl.gz"
  • config_name: cha-eng

    • data_files:
      • split: test
        • path: "test/cha-eng.jsonl.gz"
  • config_name: mhr-eng

    • data_files:
      • split: test
        • path: "test/mhr-eng.jsonl.gz"
  • config_name: dan-eng

    • data_files:
      • split: test
        • path: "test/dan-eng.jsonl.gz"
  • config_name: ell-eng

    • data_files:
      • split: test
        • path: "test/ell-eng.jsonl.gz"
  • config_name: amh-eng

    • data_files:
      • split: test
        • path: "test/amh-eng.jsonl.gz"
  • config_name: pam-eng

    • data_files:
      • split: test
        • path: "test/pam-eng.jsonl.gz"
  • config_name: hsb-eng

    • data_files:
      • split: test
        • path: "test/hsb-eng.jsonl.gz"
  • config_name: srp-eng

    • data_files:
      • split: test
        • path: "test/srp-eng.jsonl.gz"
  • config_name: epo-eng

    • data_files:
      • split: test
        • path: "test/epo-eng.jsonl.gz"
  • config_name: kzj-eng

    • data_files:
      • split: test
        • path: "test/kzj-eng.jsonl.gz"
  • config_name: awa-eng

    • data_files:
      • split: test
        • path: "test/awa-eng.jsonl.gz"
  • config_name: fao-eng

    • data_files:
      • split: test
        • path: "test/fao-eng.jsonl.gz"
  • config_name: mal-eng

    • data_files:
      • split: test
        • `path: "
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作