innadark/topxgen-gemma-3-27b-and-nllb-3.3b
收藏Hugging Face2026-01-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/innadark/topxgen-gemma-3-27b-and-nllb-3.3b
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: source
dtype: string
- name: target
dtype: string
- name: source_language
dtype: string
- name: target_language
dtype: string
splits:
- name: Basque
num_bytes: 37470947
num_examples: 120031
- name: Hausa
num_bytes: 32592247
num_examples: 101466
- name: Igbo
num_bytes: 39978029
num_examples: 133063
- name: Kinyarwanda
num_bytes: 18880086
num_examples: 57884
- name: Nepali
num_bytes: 63640738
num_examples: 142681
- name: Somali
num_bytes: 32047868
num_examples: 96315
- name: Sundanese
num_bytes: 24975269
num_examples: 78257
- name: Swahili
num_bytes: 28577864
num_examples: 86981
- name: Urdu
num_bytes: 46227533
num_examples: 131118
- name: Xhosa
num_bytes: 33036419
num_examples: 104979
download_size: 200034648
dataset_size: 357427000
configs:
- config_name: default
data_files:
- split: Basque
path: data/Basque-*
- split: Hausa
path: data/Hausa-*
- split: Igbo
path: data/Igbo-*
- split: Kinyarwanda
path: data/Kinyarwanda-*
- split: Nepali
path: data/Nepali-*
- split: Somali
path: data/Somali-*
- split: Sundanese
path: data/Sundanese-*
- split: Swahili
path: data/Swahili-*
- split: Urdu
path: data/Urdu-*
- split: Xhosa
path: data/Xhosa-*
task_categories:
- translation
language:
- eu
- ha
- ig
- rw
- ne
- so
- su
- sw
- ur
- xh
---
# TopXGen: Topic-Diverse Parallel Data for Low-Resource MT
## Dataset Summary
This dataset is a synthetic parallel dataset for 10 low-resource languages, created by applying the **TopXGen** pipeline with recent multilingual LLMs. It is designed for machine translation (MT) fine-tuning and few-shot experiments (as a selection pool).
The pipeline works as follows:
1. **Topic-diverse paragraph generation** in the target low-resource language using an LLM (generator), with diversity controlled via topic selection and temperature settings.
2. **Sentence splitting** followed by translation/back-translation with a MT model (back-translator).
3. **Redundancy removal** similar to the *self-instruct* approach.
Models trained on this TopXGen dataset achieve translation performance close to that of the generator and back-translator. For more details, see our [paper](https://arxiv.org/abs/2508.08680).
## Supported Languages
- **Basque (eus)**
- **Hausa (hau)**
- **Igbo (ibo)**
- **Kinyarwanda (kin)**
- **Nepali (nep)**
- **Somali (som)**
- **Sundanese (sun)**
- **Swahili (swh)**
- **Urdu (urd)**
- **Xhosa (xho)**
## Data Generation
- **Generator:** [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it)
- **Back-translator:** [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)
## Example Usage
```python
from datasets import load_dataset
dataset = load_dataset("almanach/topxgen-gemma-3-27b-and-nllb-3.3b", split="Basque")
print(dataset)
```
Output
```
Dataset({
features: ['source', 'target', 'source_language', 'target_language'],
num_rows: 120031
})
```
## Licensing
This dataset is derived from outputs of Google’s Gemma-3 and Meta’s NLLB. Users must comply with the licenses and usage guidelines of both models.
提供机构:
innadark



