vpermilp/nllb-200-distilled-600M-rust
收藏Hugging Face2023-03-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/vpermilp/nllb-200-distilled-600M-rust
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ace
- acm
- acq
- aeb
- af
- ajp
- ak
- als
- am
- apc
- ar
- ars
- ary
- arz
- as
- ast
- awa
- ayr
- azb
- azj
- ba
- bm
- ban
- be
- bem
- bn
- bho
- bjn
- bo
- bs
- bug
- bg
- ca
- ceb
- cs
- cjk
- ckb
- crh
- cy
- da
- de
- dik
- dyu
- dz
- el
- en
- eo
- et
- eu
- ee
- fo
- fj
- fi
- fon
- fr
- fur
- fuv
- gaz
- gd
- ga
- gl
- gn
- gu
- ht
- ha
- he
- hi
- hne
- hr
- hu
- hy
- ig
- ilo
- id
- is
- it
- jv
- ja
- kab
- kac
- kam
- kn
- ks
- ka
- kk
- kbp
- kea
- khk
- km
- ki
- rw
- ky
- kmb
- kmr
- knc
- kg
- ko
- lo
- lij
- li
- ln
- lt
- lmo
- ltg
- lb
- lua
- lg
- luo
- lus
- lvs
- mag
- mai
- ml
- mar
- min
- mk
- mt
- mni
- mos
- mi
- my
- nl
- nn
- nb
- npi
- nso
- nus
- ny
- oc
- ory
- pag
- pa
- pap
- pbt
- pes
- plt
- pl
- pt
- prs
- quy
- ro
- rn
- ru
- sg
- sa
- sat
- scn
- shn
- si
- sk
- sl
- sm
- sn
- sd
- so
- st
- es
- sc
- sr
- ss
- su
- sv
- swh
- szl
- ta
- taq
- tt
- te
- tg
- tl
- th
- ti
- tpi
- tn
- ts
- tk
- tum
- tr
- tw
- tzm
- ug
- uk
- umb
- ur
- uzn
- vec
- vi
- war
- wo
- xh
- ydd
- yo
- yue
- zh
- zsm
- zu
language_details: >-
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr,
kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn,
lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn,
ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva,
mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn,
mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn,
nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn,
gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn,
prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn,
san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn,
smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn,
srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn,
tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi,
taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn,
tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab,
uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr,
yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
tags:
- nllb
- translation
license: cc-by-nc-4.0
datasets:
- flores-200
metrics:
- bleu
- spbleu
- chrf++
inference: false
task_categories:
- translation
size_categories:
- 100K<n<1M
---
# NLLB-200
This is the model card of NLLB-200's distilled 600M variant.
Here are the [metrics](https://tinyurl.com/nllb200densedst600mmetrics) for that particular checkpoint.
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper.
- Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022
- License: CC-BY-NC
- Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues
## Intended Use
- Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
- Primary intended users: Primary users are researchers and machine translation research community.
- Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.
## Metrics
• Model performance measures: NLLB-200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by machine translation community. Additionally, we performed human evaluation with the XSTS protocol and measured the toxicity of the generated translations.
## Evaluation Data
- Datasets: Flores-200 dataset is described in Section 4
- Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
- Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The
SentencePiece model is released along with NLLB-200.
## Training Data
• We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.
## Ethical Considerations
• In this work, we took a reflexive approach in technological development to ensure that we prioritize human users and minimize risks that could be transferred to them. While we reflect on our ethical considerations throughout the article, here are some additional points to highlight. For one, many languages chosen for this study are low-resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many in these communities, such an access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. The latter scenarios could arise if bad actors misappropriate our work for nefarious activities, which we conceive as an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although we invested heavily in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although we did our best to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).
## Caveats and Recommendations
• Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB-MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.
## Carbon Footprint Details
• The carbon dioxide (CO2e) estimate is reported in Section 8.8.
language:
- ace
- acm
- acq
- aeb
- af
- ajp
- ak
- als
- am
- apc
- ar
- ars
- ary
- arz
- as
- ast
- awa
- ayr
- azb
- azj
- ba
- bm
- ban
- be
- bem
- bn
- bho
- bjn
- bo
- bs
- bug
- bg
- ca
- ceb
- cs
- cjk
- ckb
- crh
- cy
- da
- de
- dik
- dyu
- dz
- el
- en
- eo
- et
- eu
- ee
- fo
- fj
- fi
- fon
- fr
- fur
- fuv
- gaz
- gd
- ga
- gl
- gn
- gu
- ht
- ha
- he
- hi
- hne
- hr
- hu
- hy
- ig
- ilo
- id
- is
- it
- jv
- ja
- kab
- kac
- kam
- kn
- ks
- ka
- kk
- kbp
- kea
- khk
- km
- ki
- rw
- ky
- kmb
- kmr
- knc
- kg
- ko
- lo
- lij
- li
- ln
- lt
- lmo
- ltg
- lb
- lua
- lg
- luo
- lus
- lvs
- mag
- mai
- ml
- mar
- min
- mk
- mt
- mni
- mos
- mi
- my
- nl
- nn
- nb
- npi
- nso
- nus
- ny
- oc
- ory
- pag
- pa
- pap
- pbt
- pes
- plt
- pl
- pt
- prs
- quy
- ro
- rn
- ru
- sg
- sa
- sat
- scn
- shn
- si
- sk
- sl
- sm
- sn
- sd
- so
- st
- es
- sc
- sr
- ss
- su
- sv
- swh
- szl
- ta
- taq
- tt
- te
- tg
- tl
- th
- ti
- tpi
- tn
- ts
- tk
- tum
- tr
- tw
- tzm
- ug
- uk
- umb
- ur
- uzn
- vec
- vi
- war
- wo
- xh
- ydd
- yo
- yue
- zh
- zsm
- zu
language_details: >-
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr,
kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn,
lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn,
ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva,
mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn,
mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn,
nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn,
gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn,
prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn,
san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn,
smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn,
srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn,
tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi,
taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn,
tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab,
uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr,
yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
tags:
- 无语言不落地(NLLB)
- 机器翻译
license: CC-BY-NC-4.0
datasets:
- Flores-200
metrics:
- BLEU
- spBLEU
- chrF++
inference: 推理功能未启用
task_categories:
- 机器翻译
size_categories:
- 100K<n<1M
---
# NLLB-200
这是NLLB-200蒸馏版600M参数变体的模型卡片。
该特定检查点的[评估指标](https://tinyurl.com/nllb200densedst600mmetrics)如下。
- 训练算法、参数、公平性约束及其他应用方法与特性说明:训练NLLB-200所使用的精确训练算法、数据集以及针对高低资源语言的数据不平衡处理策略,已在相关论文中详细阐述。
- 更多信息来源:NLLB团队等,《无语言不落地:规模化面向人类的机器翻译》,ArXiv,2022年
- 许可协议:CC-BY-NC
- 模型相关问题与反馈渠道:https://github.com/facebookresearch/fairseq/issues
## 预期用途
- 核心预期用途:NLLB-200是一款机器翻译模型,主要面向机器翻译研究领域,尤其针对低资源语言场景。该模型支持200种语言间的单句翻译。关于模型使用方法的详细说明,可在Fairseq代码仓库中获取,其中包含训练代码、评估方法及训练数据集的相关引用。
- 核心目标用户:核心用户为机器翻译领域的研究人员与学术社区。
- 不适用场景:NLLB-200属于研究用模型,未面向生产部署发布。该模型基于通用领域文本数据训练,不适用于专业领域文本(如医疗、法律领域),亦不支持文档级翻译。模型训练时输入序列长度不超过512个Token,因此对更长序列进行翻译可能导致翻译质量下降。此外,NLLB-200的翻译结果不可作为官方认证翻译使用。
## 评估指标
• 模型性能评估标准:NLLB-200模型采用机器翻译领域广泛使用的BLEU、spBLEU及chrF++指标进行性能评估。此外,我们还通过XSTS协议开展了人工评估,并对生成翻译的毒性水平进行了检测。
## 评估数据集
- 数据集:Flores-200数据集的详细说明见论文第4节
- 评估动机:选择Flores-200是因其可全面覆盖NLLB-200支持的所有语言
- 预处理流程:原始文本经分句后,使用SentencePiece进行预处理。SentencePiece模型随NLLB-200一同发布。
## 训练数据
• 我们使用来自多种渠道的并行多语言数据训练该模型。关于数据选择与构建流程的详细说明,请参阅论文第5节。此外,我们还使用了从Common Crawl获取的单语数据,相关细节可参见论文5.2节。
## 伦理考量
• 本研究在技术开发过程中采用反思式方法,以优先保障人类用户权益并最大限度降低潜在风险。尽管我们在全文中均反思了相关伦理问题,以下几点仍需特别说明:
其一,本研究选取的诸多语言均为低资源语言,其中非洲语言占比颇高。高质量翻译虽可改善这些社区的教育与信息获取状况,但也可能让数字素养较低的群体更容易受到错误信息或网络诈骗的侵害。此类风险可能源于不良行为者将本研究成果用于恶意用途,我们将其视为非预期使用的典型案例。
其二,在数据采集方面,模型开发所用的训练数据均来自网络上的各类公开资源。尽管我们投入大量精力进行数据清洗,但个人身份信息可能仍未被完全清除。
最后,尽管我们已尽全力优化翻译质量,模型仍可能产生误译。尽管概率较低,但此类误译可能对依赖翻译结果做出重要决策(尤其是涉及健康与安全决策)的用户造成不利影响。
## 注意事项与建议
• 本模型仅在维基百科领域进行了测试,对NLLB-MD支持的其他领域的调研较为有限。此外,部分支持语言可能存在变体,而本模型未能覆盖这些变体。使用者应自行进行适当评估。
## 碳足迹详情
• 二氧化碳当量(CO₂e)估算值详见论文第8.8节。
提供机构:
vpermilp
原始信息汇总
NLLB-200 数据集概述
语言支持
- 语言列表:
- ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu
- 语言详情:
- ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
标签
- nllb
- translation
许可证
- cc-by-nc-4.0
数据集
- flores-200
评估指标
- bleu
- spbleu
- chrf++
任务类别
- translation
数据集大小
- 100K<n<1M
预期用途
- 主要用途: NLLB-200 是一个主要用于机器翻译研究的模型,特别是针对低资源语言。它支持200种语言之间的单句翻译。
- 主要用户: 研究人员和机器翻译研究社区。
- 超出范围的用途: NLLB-200 是一个研究模型,不适用于生产部署。它不适用于特定领域的文本,如医疗或法律领域。该模型不适用于文档翻译,且输入长度超过512个标记的翻译可能会导致质量下降。NLLB-200 的翻译不能用作认证翻译。
评估数据
- 数据集: Flores-200
- 动机: 使用 Flores-200 是因为它提供了 NLLB-200 中所有语言的全面评估。
- 预处理: 使用 SentencePiece 对原始文本数据进行预处理。
训练数据
- 数据来源: 使用多种来源的并行多语言数据进行训练,包括 Common Crawl 的单语数据。
伦理考虑
- 低资源语言: 许多选择的语言是低资源语言,特别是非洲语言。高质量的翻译可以改善这些社区的教育和信息获取,但也可能使数字素养较低的群体更容易受到错误信息或在线诈骗的影响。
- 数据采集: 训练数据来自各种公开可用的网络资源,尽管进行了大量数据清洗,但个人身份信息可能并未完全消除。
- 翻译质量: 尽管我们尽力优化翻译质量,但模型产生的误译仍可能存在,这可能对依赖这些翻译做出重要决策的人产生不利影响。
注意事项和建议
- 模型测试: 该模型已在 Wikimedia 领域进行了测试,但对 NLLB-MD 支持的其他领域的研究有限。用户应进行适当的评估。
碳足迹详情
- 碳排放估计: 在相关章节中报告。
搜集汇总
数据集介绍

构建方式
在机器翻译领域,多语言模型的构建往往依赖于大规模平行语料库的精心整合。NLLB-200-distilled-600M模型基于NLLB-200框架,通过知识蒸馏技术从原始庞大模型中提炼出轻量化的600M参数版本。其训练数据源自Flores-200等多源平行语料,并辅以Common Crawl的单语数据,经过严格的句子分割与SentencePiece预处理,确保了跨200种语言对的高质量对齐。构建过程中特别关注低资源语言的平衡性,采用特定策略处理数据不均衡问题,以提升模型在广泛语言覆盖下的翻译鲁棒性。
特点
该数据集的核心特征在于其前所未有的语言覆盖广度,囊括了200种语言及其变体,尤其强调非洲等地区的低资源语言。每种语言均标注了具体的文字体系,如拉丁文、阿拉伯文或西里尔字母等,为跨文字体系的翻译研究提供了细致支持。模型作为蒸馏版本,在保持较高翻译性能的同时显著减少了参数量,便于研究部署。评估指标涵盖BLEU、spBLEU与chrF++,并辅以人工XSTS协议评估,全面衡量翻译质量与生成文本的毒性风险。
使用方法
该数据集主要面向机器翻译研究,尤其适用于低资源语言翻译的学术探索。使用者可通过Fairseq代码库加载模型,实现单句翻译任务,但输入文本长度需控制在512个词元以内以避免质量衰减。模型设计为研究用途,不推荐用于生产环境或法律、医疗等专业领域。在应用时,需注意其训练数据源于公开网络,可能残留个人信息,且对特定语言变体的覆盖可能存在局限,建议用户结合具体领域进行审慎评估与验证。
背景与挑战
背景概述
在机器翻译领域,跨语言沟通的障碍长期制约着全球信息流动,尤其对于资源匮乏的语言而言,高质量翻译模型的缺失加剧了数字鸿沟。NLLB-200数据集由Meta AI研究团队于2022年推出,旨在构建一个覆盖200种语言的蒸馏化机器翻译模型,其核心研究问题聚焦于通过规模化多语言数据与先进蒸馏技术,实现低资源语言的高效翻译。该数据集基于Flores-200评估框架,采用CC-BY-NC许可,推动了机器翻译研究向更广泛语言覆盖的拓展,为跨语言人工智能应用奠定了重要基础。
当前挑战
NLLB-200数据集面临的挑战主要体现在领域问题与构建过程两方面。在领域问题上,机器翻译需克服低资源语言数据稀缺性、语言形态多样性以及文化语境差异导致的翻译质量不稳定性,同时需平衡高资源与低资源语言间的性能差距。构建过程中,挑战涉及从公开网络源大规模采集并清洗多语言平行语料时,如何有效消除个人身份信息残留,并确保数据代表性;此外,模型蒸馏需在保持200种语言间翻译一致性的前提下,优化计算效率与碳排放控制,避免因输入长度限制或领域适配不足而产生翻译偏差。
常用场景
经典使用场景
在机器翻译研究领域,NLLB-200蒸馏600M模型作为一项突破性资源,其经典使用场景聚焦于跨语言文本转换的基准测试与算法验证。该模型覆盖200种语言,尤其针对低资源语言对,为研究者提供了标准化的评估框架。通过Flores-200等平行语料库,学者能够系统性地衡量翻译质量,探索多语言表示学习的边界,推动神经机器翻译在语言多样性方面的理论进展。
解决学术问题
该数据集有效缓解了机器翻译研究中低资源语言数据匮乏的长期困境。传统翻译模型往往依赖英语为中心的语言对,而NLLB-200通过整合非洲、南亚等地区语言,打破了数据不对称的壁垒。其意义在于构建了首个大规模多语言翻译基准,使研究者能够量化语言间的迁移效应,为跨语言表示学习、零样本翻译等前沿课题提供了实证基础,推动了计算语言学向更公平的语言技术演进。
衍生相关工作
NLLB-200的发布催生了多语言翻译生态的系列研究。基于其架构,学界涌现出针对特定语言对的微调方法,如适应方言变体的参数高效训练技术。同时,该模型启发了对翻译公平性的深入探讨,衍生出针对语言毒性检测、偏见缓解的跨文化评估框架。在工程层面,相关研究聚焦于模型压缩与部署优化,使多语言翻译能力向边缘设备延伸,形成了从理论到实践的完整研究脉络。
以上内容由遇见数据集搜集并总结生成



