mteb/tatoeba-bitext-mining
收藏数据集概述
语言支持
数据集支持以下语言:
- eng
- sqi
- fry
- kur
- tur
- deu
- nld
- ron
- ang
- ido
- jav
- isl
- slv
- cym
- kaz
- est
- heb
- gla
- mar
- lat
- bel
- pms
- gle
- pes
- nob
- bul
- cbk
- hun
- uig
- rus
- spa
- hye
- tel
- afr
- mon
- arz
- hrv
- nov
- gsw
- nds
- ukr
- uzb
- lit
- ina
- lfn
- zsm
- ita
- cmn
- lvs
- glg
- ceb
- bre
- ben
- swg
- arq
- kab
- fra
- por
- tat
- oci
- pol
- war
- aze
- vie
- nno
- cha
- mhr
- dan
- ell
- amh
- pam
- hsb
- srp
- epo
- kzj
- awa
- fao
- mal
- ile
- bos
- cor
- cat
- eus
- yue
- swe
- dtp
- kat
- jpn
- csb
- xho
- orv
- ind
- tuk
- max
- swh
- hin
- dsb
- ber
- tam
- slk
- tgl
- ast
- mkd
- khm
- ces
- tzl
- urd
- ara
- kor
- yid
- fin
- tha
- wuu
配置文件
数据集包含多个配置文件,每个配置文件对应不同的语言对,所有配置文件均包含测试数据。以下是部分配置文件示例:
-
config_name: defaultdata_files:split: testpath: "test/*"
-
config_name: sqi-engdata_files:split: testpath: "test/sqi-eng.jsonl.gz"
-
config_name: fry-engdata_files:split: testpath: "test/fry-eng.jsonl.gz"
-
config_name: kur-engdata_files:split: testpath: "test/kur-eng.jsonl.gz"
-
config_name: tur-engdata_files:split: testpath: "test/tur-eng.jsonl.gz"
-
config_name: deu-engdata_files:split: testpath: "test/deu-eng.jsonl.gz"
-
config_name: nld-engdata_files:split: testpath: "test/nld-eng.jsonl.gz"
-
config_name: ron-engdata_files:split: testpath: "test/ron-eng.jsonl.gz"
-
config_name: ang-engdata_files:split: testpath: "test/ang-eng.jsonl.gz"
-
config_name: ido-engdata_files:split: testpath: "test/ido-eng.jsonl.gz"
-
config_name: jav-engdata_files:split: testpath: "test/jav-eng.jsonl.gz"
-
config_name: isl-engdata_files:split: testpath: "test/isl-eng.jsonl.gz"
-
config_name: slv-engdata_files:split: testpath: "test/slv-eng.jsonl.gz"
-
config_name: cym-engdata_files:split: testpath: "test/cym-eng.jsonl.gz"
-
config_name: kaz-engdata_files:split: testpath: "test/kaz-eng.jsonl.gz"
-
config_name: est-engdata_files:split: testpath: "test/est-eng.jsonl.gz"
-
config_name: heb-engdata_files:split: testpath: "test/heb-eng.jsonl.gz"
-
config_name: gla-engdata_files:split: testpath: "test/gla-eng.jsonl.gz"
-
config_name: mar-engdata_files:split: testpath: "test/mar-eng.jsonl.gz"
-
config_name: lat-engdata_files:split: testpath: "test/lat-eng.jsonl.gz"
-
config_name: bel-engdata_files:split: testpath: "test/bel-eng.jsonl.gz"
-
config_name: pms-engdata_files:split: testpath: "test/pms-eng.jsonl.gz"
-
config_name: gle-engdata_files:split: testpath: "test/gle-eng.jsonl.gz"
-
config_name: pes-engdata_files:split: testpath: "test/pes-eng.jsonl.gz"
-
config_name: nob-engdata_files:split: testpath: "test/nob-eng.jsonl.gz"
-
config_name: bul-engdata_files:split: testpath: "test/bul-eng.jsonl.gz"
-
config_name: cbk-engdata_files:split: testpath: "test/cbk-eng.jsonl.gz"
-
config_name: hun-engdata_files:split: testpath: "test/hun-eng.jsonl.gz"
-
config_name: uig-engdata_files:split: testpath: "test/uig-eng.jsonl.gz"
-
config_name: rus-engdata_files:split: testpath: "test/rus-eng.jsonl.gz"
-
config_name: spa-engdata_files:split: testpath: "test/spa-eng.jsonl.gz"
-
config_name: hye-engdata_files:split: testpath: "test/hye-eng.jsonl.gz"
-
config_name: tel-engdata_files:split: testpath: "test/tel-eng.jsonl.gz"
-
config_name: afr-engdata_files:split: testpath: "test/afr-eng.jsonl.gz"
-
config_name: mon-engdata_files:split: testpath: "test/mon-eng.jsonl.gz"
-
config_name: arz-engdata_files:split: testpath: "test/arz-eng.jsonl.gz"
-
config_name: hrv-engdata_files:split: testpath: "test/hrv-eng.jsonl.gz"
-
config_name: nov-engdata_files:split: testpath: "test/nov-eng.jsonl.gz"
-
config_name: gsw-engdata_files:split: testpath: "test/gsw-eng.jsonl.gz"
-
config_name: nds-engdata_files:split: testpath: "test/nds-eng.jsonl.gz"
-
config_name: ukr-engdata_files:split: testpath: "test/ukr-eng.jsonl.gz"
-
config_name: uzb-engdata_files:split: testpath: "test/uzb-eng.jsonl.gz"
-
config_name: lit-engdata_files:split: testpath: "test/lit-eng.jsonl.gz"
-
config_name: ina-engdata_files:split: testpath: "test/ina-eng.jsonl.gz"
-
config_name: lfn-engdata_files:split: testpath: "test/lfn-eng.jsonl.gz"
-
config_name: zsm-engdata_files:split: testpath: "test/zsm-eng.jsonl.gz"
-
config_name: ita-engdata_files:split: testpath: "test/ita-eng.jsonl.gz"
-
config_name: cmn-engdata_files:split: testpath: "test/cmn-eng.jsonl.gz"
-
config_name: lvs-engdata_files:split: testpath: "test/lvs-eng.jsonl.gz"
-
config_name: glg-engdata_files:split: testpath: "test/glg-eng.jsonl.gz"
-
config_name: ceb-engdata_files:split: testpath: "test/ceb-eng.jsonl.gz"
-
config_name: bre-engdata_files:split: testpath: "test/bre-eng.jsonl.gz"
-
config_name: ben-engdata_files:split: testpath: "test/ben-eng.jsonl.gz"
-
config_name: swg-engdata_files:split: testpath: "test/swg-eng.jsonl.gz"
-
config_name: arq-engdata_files:split: testpath: "test/arq-eng.jsonl.gz"
-
config_name: kab-engdata_files:split: testpath: "test/kab-eng.jsonl.gz"
-
config_name: fra-engdata_files:split: testpath: "test/fra-eng.jsonl.gz"
-
config_name: por-engdata_files:split: testpath: "test/por-eng.jsonl.gz"
-
config_name: tat-engdata_files:split: testpath: "test/tat-eng.jsonl.gz"
-
config_name: oci-engdata_files:split: testpath: "test/oci-eng.jsonl.gz"
-
config_name: pol-engdata_files:split: testpath: "test/pol-eng.jsonl.gz"
-
config_name: war-engdata_files:split: testpath: "test/war-eng.jsonl.gz"
-
config_name: aze-engdata_files:split: testpath: "test/aze-eng.jsonl.gz"
-
config_name: vie-engdata_files:split: testpath: "test/vie-eng.jsonl.gz"
-
config_name: nno-engdata_files:split: testpath: "test/nno-eng.jsonl.gz"
-
config_name: cha-engdata_files:split: testpath: "test/cha-eng.jsonl.gz"
-
config_name: mhr-engdata_files:split: testpath: "test/mhr-eng.jsonl.gz"
-
config_name: dan-engdata_files:split: testpath: "test/dan-eng.jsonl.gz"
-
config_name: ell-engdata_files:split: testpath: "test/ell-eng.jsonl.gz"
-
config_name: amh-engdata_files:split: testpath: "test/amh-eng.jsonl.gz"
-
config_name: pam-engdata_files:split: testpath: "test/pam-eng.jsonl.gz"
-
config_name: hsb-engdata_files:split: testpath: "test/hsb-eng.jsonl.gz"
-
config_name: srp-engdata_files:split: testpath: "test/srp-eng.jsonl.gz"
-
config_name: epo-engdata_files:split: testpath: "test/epo-eng.jsonl.gz"
-
config_name: kzj-engdata_files:split: testpath: "test/kzj-eng.jsonl.gz"
-
config_name: awa-engdata_files:split: testpath: "test/awa-eng.jsonl.gz"
-
config_name: fao-engdata_files:split: testpath: "test/fao-eng.jsonl.gz"
-
config_name: mal-engdata_files:split: test- `path: "



