vpermilp/nllb-200-1.3B-rust
收藏Hugging Face2023-03-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/vpermilp/nllb-200-1.3B-rust
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ace
- acm
- acq
- aeb
- af
- ajp
- ak
- als
- am
- apc
- ar
- ars
- ary
- arz
- as
- ast
- awa
- ayr
- azb
- azj
- ba
- bm
- ban
- be
- bem
- bn
- bho
- bjn
- bo
- bs
- bug
- bg
- ca
- ceb
- cs
- cjk
- ckb
- crh
- cy
- da
- de
- dik
- dyu
- dz
- el
- en
- eo
- et
- eu
- ee
- fo
- fj
- fi
- fon
- fr
- fur
- fuv
- gaz
- gd
- ga
- gl
- gn
- gu
- ht
- ha
- he
- hi
- hne
- hr
- hu
- hy
- ig
- ilo
- id
- is
- it
- jv
- ja
- kab
- kac
- kam
- kn
- ks
- ka
- kk
- kbp
- kea
- khk
- km
- ki
- rw
- ky
- kmb
- kmr
- knc
- kg
- ko
- lo
- lij
- li
- ln
- lt
- lmo
- ltg
- lb
- lua
- lg
- luo
- lus
- lvs
- mag
- mai
- ml
- mar
- min
- mk
- mt
- mni
- mos
- mi
- my
- nl
- nn
- nb
- npi
- nso
- nus
- ny
- oc
- ory
- pag
- pa
- pap
- pbt
- pes
- plt
- pl
- pt
- prs
- quy
- ro
- rn
- ru
- sg
- sa
- sat
- scn
- shn
- si
- sk
- sl
- sm
- sn
- sd
- so
- st
- es
- sc
- sr
- ss
- su
- sv
- swh
- szl
- ta
- taq
- tt
- te
- tg
- tl
- th
- ti
- tpi
- tn
- ts
- tk
- tum
- tr
- tw
- tzm
- ug
- uk
- umb
- ur
- uzn
- vec
- vi
- war
- wo
- xh
- ydd
- yo
- yue
- zh
- zsm
- zu
language_details: "ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn"
tags:
- nllb
- translation
license: "cc-by-nc-4.0"
datasets:
- flores-200
metrics:
- bleu
- spbleu
- chrf++
inference: false
---
# NLLB-200
This is the model card of NLLB-200's 1.3B variant.
Here are the [metrics](https://tinyurl.com/nllb200dense1bmetrics) for that particular checkpoint.
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper.
- Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022
- License: CC-BY-NC
- Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues
## Intended Use
- Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
- Primary intended users: Primary users are researchers and machine translation research community.
- Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.
## Metrics
• Model performance measures: NLLB-200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by machine translation community. Additionally, we performed human evaluation with the XSTS protocol and measured the toxicity of the generated translations.
## Evaluation Data
- Datasets: Flores-200 dataset is described in Section 4
- Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
- Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The
SentencePiece model is released along with NLLB-200.
## Training Data
• We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.
## Ethical Considerations
• In this work, we took a reflexive approach in technological development to ensure that we prioritize human users and minimize risks that could be transferred to them. While we reflect on our ethical considerations throughout the article, here are some additional points to highlight. For one, many languages chosen for this study are low-resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many in these communities, such an access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. The latter scenarios could arise if bad actors misappropriate our work for nefarious activities, which we conceive as an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although we invested heavily in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although we did our best to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).
## Caveats and Recommendations
• Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB-MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.
## Carbon Footprint Details
• The carbon dioxide (CO2e) estimate is reported in Section 8.8.
提供机构:
vpermilp
原始信息汇总
数据集概述
数据集名称
- NLLB-200
支持的语言
- 200种语言,包括但不限于:ace, acm, acq, aeb, af, ajp, aka, amh, apc, arb, ars, ary, arz, asm, ast, awa, ayr, azb, azj, bak, bam, ban, bel, bem, ben, bho, bjn, bod, bos, bug, bul, cat, ceb, ces, cjk, ckb, crh, cym, dan, deu, dik, dyu, dzo, ell, eng, epo, est, eus, ewe, fao, pes, fij, fin, fon, fra, fur, fuv, gla, gle, glg, grn, guj, hat, hau, heb, hin, hne, hrv, hun, hye, ibo, ilo, ind, isl, ita, jav, jpn, kab, kac, kam, kan, kas, kat, kaz, kbp, kea, khm, kik, kin, kir, kmb, kon, kor, kmr, lao, lvs, lij, lim, lin, lit, lmo, ltg, ltz, lua, lug, luo, lus, mag, mai, mal, mar, min, mkd, mlt, mni, mos, mri, mya, nld, nno, nob, npi, nso, nus, nya, oci, gaz, ory, pag, pan, pap, pol, por, prs, pbt, quy, ron, run, rus, sag, san, sat, scn, shn, sin, slk, slv, smo, sna, snd, som, sot, spa, als, srd, srp, ssw, sun, swe, swh, szl, tam, tat, tel, tgk, tgl, tha, tir, taq, tpi, tsn, tso, tuk, tum, tur, twi, tzm, uig, ukr, umb, urd, uzn, vec, vie, war, wol, xho, ydd, yor, yue, zho, zul.
语言详情
- 每种语言支持的书写系统(如ace_Arab, ace_Latn等)。
标签
- nllb
- translation
许可证
- CC-BY-NC-4.0
包含的数据集
- Flores-200
评估指标
- BLEU
- spBLEU
- chrf++
模型用途
- 主要用途:用于机器翻译研究,特别是低资源语言的翻译。
- 主要用户:研究人员和机器翻译研究社区。
- 非预期用途:不用于生产部署,不用于特定领域的文本翻译,如医疗或法律领域,不用于文档翻译,不用于超过512个token的输入长度翻译,不作为认证翻译使用。
评估数据
- 数据集:Flores-200
- 预处理:使用SentencePiece进行句子分割的原始文本数据预处理。
训练数据
- 使用多种来源的并行多语言数据进行训练。
伦理考虑
- 考虑了低资源语言社区可能面临的风险,如信息误传或在线诈骗。
- 训练数据来自公开可用资源,可能包含未完全消除的个人识别信息。
- 尽管努力优化翻译质量,但模型可能产生误译,可能对依赖翻译做重要决策的人产生不利影响。
注意事项
- 模型在Wikimedia领域进行了测试,但未全面评估其他支持的领域。
- 支持的语言可能存在模型未捕捉到的变体。
碳足迹详情
- 二氧化碳排放量估计在第8.8节报告。



