ljvmiranda921/PolyglotTeachers-SFT-Synth
收藏Hugging Face2026-04-14 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ljvmiranda921/PolyglotTeachers-SFT-Synth
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: strategy
dtype: string
- name: source_id
dtype: string
- name: synth_prompt
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 2326387825
num_examples: 356471
download_size: 1083096690
dataset_size: 2326387825
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- ar
- de
- id
- ja
- es
- cs
- tl
license: apache-2.0
task_categories:
- text-generation
tags:
- multilingual
- synthetic
- sft
pretty_name: PolyglotTeachers-SFT (Synthetic)
---
<img alt="Logo for LTL" src="ltl_logo2.svg" width="240px" style="margin-left:'auto' margin-right:'auto' display:'block'">
# PolyglotTeachers-SFT-Synth
This dataset contains synthetic supervised fine-tuning examples generated by the best teacher we found in the paper [Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation](), where we systematically characterize what makes a good teacher model.
It contains examples across six languages: Arabic, Czech, German, Indonesian, Japanese, Spanish, and Tagalog. **Note:** In our experiments, we subsampled 10k examples per language for training. Here we release the full unfiltered set to enable reproducibility and give researchers the flexibility to construct their own subsamples or training mixtures.
## Dataset Summary
- **Languages:** Arabic (ar), Czech (cs), German (de), Indonesian (id), Japanese (ja), Spanish (es), Tagalog (tl)
- **Total examples:** 315,596
- **Teacher model:** [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it)
- **Generation strategies:** generate, respond, translate
### Language Distribution
| Language | Examples |
|----------|----------|
| Indonesian (id) | 85,952 |
| German (de) | 83,878 |
| Arabic (ar) | 77,770 |
| Japanese (ja) | 27,198 |
| Tagalog (tl) | 40,875 |
| Spanish (es) | 25,609 |
| Czech (cs) | 15,189 |
## Data Sources
The seed data comes from several multilingual datasets, which were then used to synthesize new examples via Gemma-3-27B-IT.
Each source dataset was processed using one of three strategies: **generate** (create new prompt-response pairs from a seed), **respond** (generate a response given a prompt), or **translate** (translate an English example into a target language).
* [allenai/WildChat-4.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M): multilingual prompt-response pairs from real user interactions.
* [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k): math word problems (English, translated into target languages).
* [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered): general chat data (English, translated into target languages).
* [nvidia/Helpsteer3](https://huggingface.co/datasets/nvidia/Helpsteer3): multilingual preference data.
* [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2): multilingual assistant conversations.
* [utter-project/EuroBlocks-SFT-Synthetic-1124](https://huggingface.co/datasets/utter-project/EuroBlocks-SFT-Synthetic-1124): European multilingual synthetic data.
* [CohereLabs/aya_collection](https://huggingface.co/datasets/CohereLabs/aya_collection): multilingual instruction data.
* [arbml/CIDAR](https://huggingface.co/datasets/arbml/CIDAR): Arabic instruction data.
* [indonlp/cendol_collection_v2](https://huggingface.co/datasets/indonlp/cendol_collection_v2): Indonesian instruction data.
## Dataset Structure
Each example contains the following fields:
| Field | Type | Description |
|-------|------|-------------|
| `id` | str | Unique identifier |
| `source` | str | Source dataset name |
| `language` | str | ISO 639-1 language code |
| `strategy` | str | Synthesis strategy used (`generate`, `respond`, or `translate`) |
| `source_id` | str | Identifier from the source dataset |
| `synth_prompt` | str | The prompt used to instruct the teacher model during synthesis |
| `model` | str | Teacher model used for generation |
| `prompt` | str | The user prompt |
| `response` | str | The model response |
| `messages` | list | Chat-formatted messages (`role` and `content`) for SFT |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ljvmiranda921/PolyglotTeachers-SFT-Synth", split="train")
# Filter by language
arabic_ds = ds.filter(lambda x: x["language"] == "ar")
# Use the messages field directly for SFT
print(arabic_ds[0]["messages"])
```
## Acknowledgements
LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 ([EQUATE](https://gtr.ukri.org/projects?ref=EP%2FY031350%2F1)).
This work was performed using joint resources provided by the [Cambridge Service for Data Driven Discovery (CSD3)](https://hpc.cam.ac.uk/high-performance-computing) EP/T022159/1 and the [Isambard AI National AI Research Resource (AIRR)](https://www.bristol.ac.uk/research/centres/bristol-supercomputing/#isambard-ai) ST/AIRR/I-A-I/1023, and the Microsoft Research Grant.
LJVM would also like to thank Songbo Hu, Chen Cecilia Liu, Millicent Ochieng, and Felermino Ali for helpful and productive discussions on the project.
## Citation
```bibtex
@misc{miranda2026polyglotteachersevaluatinglanguage,
title={Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation},
author={Lester James V. Miranda and Ivan Vulić and Anna Korhonen},
year={2026},
eprint={2604.11290},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11290},
}
```
提供机构:
ljvmiranda921



