datalama/pretrain-nllb-filtered
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/datalama/pretrain-nllb-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: nllb_arb_Arab-eng_Latn
data_files:
- split: train
path: nllb_arb_Arab-eng_Latn/train-*
- config_name: nllb_ben_Beng-eng_Latn
data_files:
- split: train
path: nllb_ben_Beng-eng_Latn/train-*
- config_name: nllb_bul_Cyrl-eng_Latn
data_files:
- split: train
path: nllb_bul_Cyrl-eng_Latn/train-*
- config_name: nllb_cat_Latn-eng_Latn
data_files:
- split: train
path: nllb_cat_Latn-eng_Latn/train-*
- config_name: nllb_ces_Latn-eng_Latn
data_files:
- split: train
path: nllb_ces_Latn-eng_Latn/train-*
- config_name: nllb_dan_Latn-eng_Latn
data_files:
- split: train
path: nllb_dan_Latn-eng_Latn/train-*
- config_name: nllb_deu_Latn-eng_Latn
data_files:
- split: train
path: nllb_deu_Latn-eng_Latn/train-*
- config_name: nllb_ell_Grek-eng_Latn
data_files:
- split: train
path: nllb_ell_Grek-eng_Latn/train-*
- config_name: nllb_eng_Latn-est_Latn
data_files:
- split: train
path: nllb_eng_Latn-est_Latn/train-*
- config_name: nllb_eng_Latn-fin_Latn
data_files:
- split: train
path: nllb_eng_Latn-fin_Latn/train-*
- config_name: nllb_eng_Latn-fra_Latn
data_files:
- split: train
path: nllb_eng_Latn-fra_Latn/train-*
- config_name: nllb_eng_Latn-glg_Latn
data_files:
- split: train
path: nllb_eng_Latn-glg_Latn/train-*
- config_name: nllb_eng_Latn-heb_Hebr
data_files:
- split: train
path: nllb_eng_Latn-heb_Hebr/train-*
- config_name: nllb_eng_Latn-hin_Deva
data_files:
- split: train
path: nllb_eng_Latn-hin_Deva/train-*
- config_name: nllb_eng_Latn-hrv_Latn
data_files:
- split: train
path: nllb_eng_Latn-hrv_Latn/train-*
- config_name: nllb_eng_Latn-hun_Latn
data_files:
- split: train
path: nllb_eng_Latn-hun_Latn/train-*
- config_name: nllb_eng_Latn-ind_Latn
data_files:
- split: train
path: nllb_eng_Latn-ind_Latn/train-*
- config_name: nllb_eng_Latn-isl_Latn
data_files:
- split: train
path: nllb_eng_Latn-isl_Latn/train-*
- config_name: nllb_eng_Latn-ita_Latn
data_files:
- split: train
path: nllb_eng_Latn-ita_Latn/train-*
- config_name: nllb_eng_Latn-jpn_Jpan
data_files:
- split: train
path: nllb_eng_Latn-jpn_Jpan/train-*
- config_name: nllb_eng_Latn-kaz_Cyrl
data_files:
- split: train
path: nllb_eng_Latn-kaz_Cyrl/train-*
- config_name: nllb_eng_Latn-khm_Khmr
data_files:
- split: train
path: nllb_eng_Latn-khm_Khmr/train-*
- config_name: nllb_eng_Latn-kor_Hang
data_files:
- split: train
path: nllb_eng_Latn-kor_Hang/train-*
- config_name: nllb_eng_Latn-lit_Latn
data_files:
- split: train
path: nllb_eng_Latn-lit_Latn/train-*
- config_name: nllb_eng_Latn-lvs_Latn
data_files:
- split: train
path: nllb_eng_Latn-lvs_Latn/train-*
- config_name: nllb_eng_Latn-mal_Mlym
data_files:
- split: train
path: nllb_eng_Latn-mal_Mlym/train-*
- config_name: nllb_eng_Latn-mar_Deva
data_files:
- split: train
path: nllb_eng_Latn-mar_Deva/train-*
- config_name: nllb_eng_Latn-mkd_Cyrl
data_files:
- split: train
path: nllb_eng_Latn-mkd_Cyrl/train-*
- config_name: nllb_eng_Latn-mya_Mymr
data_files:
- split: train
path: nllb_eng_Latn-mya_Mymr/train-*
- config_name: nllb_eng_Latn-nld_Latn
data_files:
- split: train
path: nllb_eng_Latn-nld_Latn/train-*
- config_name: nllb_eng_Latn-pes_Arab
data_files:
- split: train
path: nllb_eng_Latn-pes_Arab/train-*
- config_name: nllb_eng_Latn-pol_Latn
data_files:
- split: train
path: nllb_eng_Latn-pol_Latn/train-*
- config_name: nllb_eng_Latn-por_Latn
data_files:
- split: train
path: nllb_eng_Latn-por_Latn/train-*
- config_name: nllb_eng_Latn-ron_Latn
data_files:
- split: train
path: nllb_eng_Latn-ron_Latn/train-*
- config_name: nllb_eng_Latn-rus_Cyrl
data_files:
- split: train
path: nllb_eng_Latn-rus_Cyrl/train-*
- config_name: nllb_eng_Latn-slk_Latn
data_files:
- split: train
path: nllb_eng_Latn-slk_Latn/train-*
- config_name: nllb_eng_Latn-slv_Latn
data_files:
- split: train
path: nllb_eng_Latn-slv_Latn/train-*
- config_name: nllb_eng_Latn-spa_Latn
data_files:
- split: train
path: nllb_eng_Latn-spa_Latn/train-*
- config_name: nllb_eng_Latn-srp_Cyrl
data_files:
- split: train
path: nllb_eng_Latn-srp_Cyrl/train-*
- config_name: nllb_eng_Latn-swe_Latn
data_files:
- split: train
path: nllb_eng_Latn-swe_Latn/train-*
- config_name: nllb_eng_Latn-swh_Latn
data_files:
- split: train
path: nllb_eng_Latn-swh_Latn/train-*
- config_name: nllb_eng_Latn-tam_Taml
data_files:
- split: train
path: nllb_eng_Latn-tam_Taml/train-*
- config_name: nllb_eng_Latn-tel_Telu
data_files:
- split: train
path: nllb_eng_Latn-tel_Telu/train-*
- config_name: nllb_eng_Latn-tgl_Latn
data_files:
- split: train
path: nllb_eng_Latn-tgl_Latn/train-*
- config_name: nllb_eng_Latn-tur_Latn
data_files:
- split: train
path: nllb_eng_Latn-tur_Latn/train-*
- config_name: nllb_eng_Latn-ukr_Cyrl
data_files:
- split: train
path: nllb_eng_Latn-ukr_Cyrl/train-*
- config_name: nllb_eng_Latn-urd_Arab
data_files:
- split: train
path: nllb_eng_Latn-urd_Arab/train-*
- config_name: nllb_eng_Latn-vie_Latn
data_files:
- split: train
path: nllb_eng_Latn-vie_Latn/train-*
- config_name: nllb_eng_Latn-zho_Hans
data_files:
- split: train
path: nllb_eng_Latn-zho_Hans/train-*
- config_name: nllb_eng_Latn-zho_Hant
data_files:
- split: train
path: nllb_eng_Latn-zho_Hant/train-*
- config_name: nllb_eng_Latn-zsm_Latn
data_files:
- split: train
path: nllb_eng_Latn-zsm_Latn/train-*
---
# pretrain-nllb-filtered
Filtered parallel corpus from [allenai/nllb](https://huggingface.co/datasets/allenai/nllb) for cross-lingual embedding pretraining.
## Schema
```json
{"query": "string", "pos": ["string", ...]}
```
- `query`: source language sentence
- `pos`: target language sentence(s)
## Configs (51 language pairs)
| Config | Parquet Files |
|--------|-------------:|
| `nllb_arb_Arab-eng_Latn` | 20 |
| `nllb_ben_Beng-eng_Latn` | 6 |
| `nllb_bul_Cyrl-eng_Latn` | 15 |
| `nllb_cat_Latn-eng_Latn` | 5 |
| `nllb_ces_Latn-eng_Latn` | 14 |
| `nllb_dan_Latn-eng_Latn` | 12 |
| `nllb_deu_Latn-eng_Latn` | 77 |
| `nllb_ell_Grek-eng_Latn` | 20 |
| `nllb_eng_Latn-est_Latn` | 7 |
| `nllb_eng_Latn-fin_Latn` | 12 |
| `nllb_eng_Latn-fra_Latn` | 121 |
| `nllb_eng_Latn-glg_Latn` | 4 |
| `nllb_eng_Latn-heb_Hebr` | 9 |
| `nllb_eng_Latn-hin_Deva` | 5 |
| `nllb_eng_Latn-hrv_Latn` | 7 |
| `nllb_eng_Latn-hun_Latn` | 13 |
| `nllb_eng_Latn-ind_Latn` | 27 |
| `nllb_eng_Latn-isl_Latn` | 3 |
| `nllb_eng_Latn-ita_Latn` | 59 |
| `nllb_eng_Latn-jpn_Jpan` | 11 |
| `nllb_eng_Latn-kaz_Cyrl` | 5 |
| `nllb_eng_Latn-khm_Khmr` | 2 |
| `nllb_eng_Latn-kor_Hang` | 6 |
| `nllb_eng_Latn-lit_Latn` | 8 |
| `nllb_eng_Latn-lvs_Latn` | 5 |
| `nllb_eng_Latn-mal_Mlym` | 9 |
| `nllb_eng_Latn-mar_Deva` | 6 |
| `nllb_eng_Latn-mkd_Cyrl` | 6 |
| `nllb_eng_Latn-mya_Mymr` | 2 |
| `nllb_eng_Latn-nld_Latn` | 37 |
| `nllb_eng_Latn-pes_Arab` | 11 |
| `nllb_eng_Latn-pol_Latn` | 27 |
| `nllb_eng_Latn-por_Latn` | 69 |
| `nllb_eng_Latn-ron_Latn` | 21 |
| `nllb_eng_Latn-rus_Cyrl` | 68 |
| `nllb_eng_Latn-slk_Latn` | 13 |
| `nllb_eng_Latn-slv_Latn` | 9 |
| `nllb_eng_Latn-spa_Latn` | 153 |
| `nllb_eng_Latn-srp_Cyrl` | 9 |
| `nllb_eng_Latn-swe_Latn` | 24 |
| `nllb_eng_Latn-swh_Latn` | 3 |
| `nllb_eng_Latn-tam_Taml` | 9 |
| `nllb_eng_Latn-tel_Telu` | 10 |
| `nllb_eng_Latn-tgl_Latn` | 8 |
| `nllb_eng_Latn-tur_Latn` | 15 |
| `nllb_eng_Latn-ukr_Cyrl` | 10 |
| `nllb_eng_Latn-urd_Arab` | 5 |
| `nllb_eng_Latn-vie_Latn` | 21 |
| `nllb_eng_Latn-zho_Hans` | 16 |
| `nllb_eng_Latn-zho_Hant` | 1 |
| `nllb_eng_Latn-zsm_Latn` | 8 |
Updated on 2026-03-22
提供机构:
datalama



