five

datalama/pretrain-nllb-filtered

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/datalama/pretrain-nllb-filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: nllb_arb_Arab-eng_Latn data_files: - split: train path: nllb_arb_Arab-eng_Latn/train-* - config_name: nllb_ben_Beng-eng_Latn data_files: - split: train path: nllb_ben_Beng-eng_Latn/train-* - config_name: nllb_bul_Cyrl-eng_Latn data_files: - split: train path: nllb_bul_Cyrl-eng_Latn/train-* - config_name: nllb_cat_Latn-eng_Latn data_files: - split: train path: nllb_cat_Latn-eng_Latn/train-* - config_name: nllb_ces_Latn-eng_Latn data_files: - split: train path: nllb_ces_Latn-eng_Latn/train-* - config_name: nllb_dan_Latn-eng_Latn data_files: - split: train path: nllb_dan_Latn-eng_Latn/train-* - config_name: nllb_deu_Latn-eng_Latn data_files: - split: train path: nllb_deu_Latn-eng_Latn/train-* - config_name: nllb_ell_Grek-eng_Latn data_files: - split: train path: nllb_ell_Grek-eng_Latn/train-* - config_name: nllb_eng_Latn-est_Latn data_files: - split: train path: nllb_eng_Latn-est_Latn/train-* - config_name: nllb_eng_Latn-fin_Latn data_files: - split: train path: nllb_eng_Latn-fin_Latn/train-* - config_name: nllb_eng_Latn-fra_Latn data_files: - split: train path: nllb_eng_Latn-fra_Latn/train-* - config_name: nllb_eng_Latn-glg_Latn data_files: - split: train path: nllb_eng_Latn-glg_Latn/train-* - config_name: nllb_eng_Latn-heb_Hebr data_files: - split: train path: nllb_eng_Latn-heb_Hebr/train-* - config_name: nllb_eng_Latn-hin_Deva data_files: - split: train path: nllb_eng_Latn-hin_Deva/train-* - config_name: nllb_eng_Latn-hrv_Latn data_files: - split: train path: nllb_eng_Latn-hrv_Latn/train-* - config_name: nllb_eng_Latn-hun_Latn data_files: - split: train path: nllb_eng_Latn-hun_Latn/train-* - config_name: nllb_eng_Latn-ind_Latn data_files: - split: train path: nllb_eng_Latn-ind_Latn/train-* - config_name: nllb_eng_Latn-isl_Latn data_files: - split: train path: nllb_eng_Latn-isl_Latn/train-* - config_name: nllb_eng_Latn-ita_Latn data_files: - split: train path: nllb_eng_Latn-ita_Latn/train-* - config_name: nllb_eng_Latn-jpn_Jpan data_files: - split: train path: nllb_eng_Latn-jpn_Jpan/train-* - config_name: nllb_eng_Latn-kaz_Cyrl data_files: - split: train path: nllb_eng_Latn-kaz_Cyrl/train-* - config_name: nllb_eng_Latn-khm_Khmr data_files: - split: train path: nllb_eng_Latn-khm_Khmr/train-* - config_name: nllb_eng_Latn-kor_Hang data_files: - split: train path: nllb_eng_Latn-kor_Hang/train-* - config_name: nllb_eng_Latn-lit_Latn data_files: - split: train path: nllb_eng_Latn-lit_Latn/train-* - config_name: nllb_eng_Latn-lvs_Latn data_files: - split: train path: nllb_eng_Latn-lvs_Latn/train-* - config_name: nllb_eng_Latn-mal_Mlym data_files: - split: train path: nllb_eng_Latn-mal_Mlym/train-* - config_name: nllb_eng_Latn-mar_Deva data_files: - split: train path: nllb_eng_Latn-mar_Deva/train-* - config_name: nllb_eng_Latn-mkd_Cyrl data_files: - split: train path: nllb_eng_Latn-mkd_Cyrl/train-* - config_name: nllb_eng_Latn-mya_Mymr data_files: - split: train path: nllb_eng_Latn-mya_Mymr/train-* - config_name: nllb_eng_Latn-nld_Latn data_files: - split: train path: nllb_eng_Latn-nld_Latn/train-* - config_name: nllb_eng_Latn-pes_Arab data_files: - split: train path: nllb_eng_Latn-pes_Arab/train-* - config_name: nllb_eng_Latn-pol_Latn data_files: - split: train path: nllb_eng_Latn-pol_Latn/train-* - config_name: nllb_eng_Latn-por_Latn data_files: - split: train path: nllb_eng_Latn-por_Latn/train-* - config_name: nllb_eng_Latn-ron_Latn data_files: - split: train path: nllb_eng_Latn-ron_Latn/train-* - config_name: nllb_eng_Latn-rus_Cyrl data_files: - split: train path: nllb_eng_Latn-rus_Cyrl/train-* - config_name: nllb_eng_Latn-slk_Latn data_files: - split: train path: nllb_eng_Latn-slk_Latn/train-* - config_name: nllb_eng_Latn-slv_Latn data_files: - split: train path: nllb_eng_Latn-slv_Latn/train-* - config_name: nllb_eng_Latn-spa_Latn data_files: - split: train path: nllb_eng_Latn-spa_Latn/train-* - config_name: nllb_eng_Latn-srp_Cyrl data_files: - split: train path: nllb_eng_Latn-srp_Cyrl/train-* - config_name: nllb_eng_Latn-swe_Latn data_files: - split: train path: nllb_eng_Latn-swe_Latn/train-* - config_name: nllb_eng_Latn-swh_Latn data_files: - split: train path: nllb_eng_Latn-swh_Latn/train-* - config_name: nllb_eng_Latn-tam_Taml data_files: - split: train path: nllb_eng_Latn-tam_Taml/train-* - config_name: nllb_eng_Latn-tel_Telu data_files: - split: train path: nllb_eng_Latn-tel_Telu/train-* - config_name: nllb_eng_Latn-tgl_Latn data_files: - split: train path: nllb_eng_Latn-tgl_Latn/train-* - config_name: nllb_eng_Latn-tur_Latn data_files: - split: train path: nllb_eng_Latn-tur_Latn/train-* - config_name: nllb_eng_Latn-ukr_Cyrl data_files: - split: train path: nllb_eng_Latn-ukr_Cyrl/train-* - config_name: nllb_eng_Latn-urd_Arab data_files: - split: train path: nllb_eng_Latn-urd_Arab/train-* - config_name: nllb_eng_Latn-vie_Latn data_files: - split: train path: nllb_eng_Latn-vie_Latn/train-* - config_name: nllb_eng_Latn-zho_Hans data_files: - split: train path: nllb_eng_Latn-zho_Hans/train-* - config_name: nllb_eng_Latn-zho_Hant data_files: - split: train path: nllb_eng_Latn-zho_Hant/train-* - config_name: nllb_eng_Latn-zsm_Latn data_files: - split: train path: nllb_eng_Latn-zsm_Latn/train-* --- # pretrain-nllb-filtered Filtered parallel corpus from [allenai/nllb](https://huggingface.co/datasets/allenai/nllb) for cross-lingual embedding pretraining. ## Schema ```json {"query": "string", "pos": ["string", ...]} ``` - `query`: source language sentence - `pos`: target language sentence(s) ## Configs (51 language pairs) | Config | Parquet Files | |--------|-------------:| | `nllb_arb_Arab-eng_Latn` | 20 | | `nllb_ben_Beng-eng_Latn` | 6 | | `nllb_bul_Cyrl-eng_Latn` | 15 | | `nllb_cat_Latn-eng_Latn` | 5 | | `nllb_ces_Latn-eng_Latn` | 14 | | `nllb_dan_Latn-eng_Latn` | 12 | | `nllb_deu_Latn-eng_Latn` | 77 | | `nllb_ell_Grek-eng_Latn` | 20 | | `nllb_eng_Latn-est_Latn` | 7 | | `nllb_eng_Latn-fin_Latn` | 12 | | `nllb_eng_Latn-fra_Latn` | 121 | | `nllb_eng_Latn-glg_Latn` | 4 | | `nllb_eng_Latn-heb_Hebr` | 9 | | `nllb_eng_Latn-hin_Deva` | 5 | | `nllb_eng_Latn-hrv_Latn` | 7 | | `nllb_eng_Latn-hun_Latn` | 13 | | `nllb_eng_Latn-ind_Latn` | 27 | | `nllb_eng_Latn-isl_Latn` | 3 | | `nllb_eng_Latn-ita_Latn` | 59 | | `nllb_eng_Latn-jpn_Jpan` | 11 | | `nllb_eng_Latn-kaz_Cyrl` | 5 | | `nllb_eng_Latn-khm_Khmr` | 2 | | `nllb_eng_Latn-kor_Hang` | 6 | | `nllb_eng_Latn-lit_Latn` | 8 | | `nllb_eng_Latn-lvs_Latn` | 5 | | `nllb_eng_Latn-mal_Mlym` | 9 | | `nllb_eng_Latn-mar_Deva` | 6 | | `nllb_eng_Latn-mkd_Cyrl` | 6 | | `nllb_eng_Latn-mya_Mymr` | 2 | | `nllb_eng_Latn-nld_Latn` | 37 | | `nllb_eng_Latn-pes_Arab` | 11 | | `nllb_eng_Latn-pol_Latn` | 27 | | `nllb_eng_Latn-por_Latn` | 69 | | `nllb_eng_Latn-ron_Latn` | 21 | | `nllb_eng_Latn-rus_Cyrl` | 68 | | `nllb_eng_Latn-slk_Latn` | 13 | | `nllb_eng_Latn-slv_Latn` | 9 | | `nllb_eng_Latn-spa_Latn` | 153 | | `nllb_eng_Latn-srp_Cyrl` | 9 | | `nllb_eng_Latn-swe_Latn` | 24 | | `nllb_eng_Latn-swh_Latn` | 3 | | `nllb_eng_Latn-tam_Taml` | 9 | | `nllb_eng_Latn-tel_Telu` | 10 | | `nllb_eng_Latn-tgl_Latn` | 8 | | `nllb_eng_Latn-tur_Latn` | 15 | | `nllb_eng_Latn-ukr_Cyrl` | 10 | | `nllb_eng_Latn-urd_Arab` | 5 | | `nllb_eng_Latn-vie_Latn` | 21 | | `nllb_eng_Latn-zho_Hans` | 16 | | `nllb_eng_Latn-zho_Hant` | 1 | | `nllb_eng_Latn-zsm_Latn` | 8 | Updated on 2026-03-22
提供机构:
datalama
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作