innadark/topxgen-gemma-3-27b-and-nllb-3.3b

Name: innadark/topxgen-gemma-3-27b-and-nllb-3.3b
Creator: innadark
Published: 2026-01-12 19:36:38
License: 暂无描述

Hugging Face2026-01-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/innadark/topxgen-gemma-3-27b-and-nllb-3.3b

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: source dtype: string - name: target dtype: string - name: source_language dtype: string - name: target_language dtype: string splits: - name: Basque num_bytes: 37470947 num_examples: 120031 - name: Hausa num_bytes: 32592247 num_examples: 101466 - name: Igbo num_bytes: 39978029 num_examples: 133063 - name: Kinyarwanda num_bytes: 18880086 num_examples: 57884 - name: Nepali num_bytes: 63640738 num_examples: 142681 - name: Somali num_bytes: 32047868 num_examples: 96315 - name: Sundanese num_bytes: 24975269 num_examples: 78257 - name: Swahili num_bytes: 28577864 num_examples: 86981 - name: Urdu num_bytes: 46227533 num_examples: 131118 - name: Xhosa num_bytes: 33036419 num_examples: 104979 download_size: 200034648 dataset_size: 357427000 configs: - config_name: default data_files: - split: Basque path: data/Basque-* - split: Hausa path: data/Hausa-* - split: Igbo path: data/Igbo-* - split: Kinyarwanda path: data/Kinyarwanda-* - split: Nepali path: data/Nepali-* - split: Somali path: data/Somali-* - split: Sundanese path: data/Sundanese-* - split: Swahili path: data/Swahili-* - split: Urdu path: data/Urdu-* - split: Xhosa path: data/Xhosa-* task_categories: - translation language: - eu - ha - ig - rw - ne - so - su - sw - ur - xh --- # TopXGen: Topic-Diverse Parallel Data for Low-Resource MT ## Dataset Summary This dataset is a synthetic parallel dataset for 10 low-resource languages, created by applying the **TopXGen** pipeline with recent multilingual LLMs. It is designed for machine translation (MT) fine-tuning and few-shot experiments (as a selection pool). The pipeline works as follows: 1. **Topic-diverse paragraph generation** in the target low-resource language using an LLM (generator), with diversity controlled via topic selection and temperature settings. 2. **Sentence splitting** followed by translation/back-translation with a MT model (back-translator). 3. **Redundancy removal** similar to the *self-instruct* approach. Models trained on this TopXGen dataset achieve translation performance close to that of the generator and back-translator. For more details, see our [paper](https://arxiv.org/abs/2508.08680). ## Supported Languages - **Basque (eus)** - **Hausa (hau)** - **Igbo (ibo)** - **Kinyarwanda (kin)** - **Nepali (nep)** - **Somali (som)** - **Sundanese (sun)** - **Swahili (swh)** - **Urdu (urd)** - **Xhosa (xho)** ## Data Generation - **Generator:** [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) - **Back-translator:** [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) ## Example Usage ```python from datasets import load_dataset dataset = load_dataset("almanach/topxgen-gemma-3-27b-and-nllb-3.3b", split="Basque") print(dataset) ``` Output ``` Dataset({ features: ['source', 'target', 'source_language', 'target_language'], num_rows: 120031 }) ``` ## Licensing This dataset is derived from outputs of Google’s Gemma-3 and Meta’s NLLB. Users must comply with the licenses and usage guidelines of both models.

提供机构：

innadark

5,000+

优质数据集

54 个

任务类型

进入经典数据集