five

ljvmiranda921/PolyglotTeachers-SFT-Synth

收藏
Hugging Face2026-04-14 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ljvmiranda921/PolyglotTeachers-SFT-Synth
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: source dtype: string - name: language dtype: string - name: strategy dtype: string - name: source_id dtype: string - name: synth_prompt dtype: string - name: model dtype: string - name: prompt dtype: string - name: response dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 2326387825 num_examples: 356471 download_size: 1083096690 dataset_size: 2326387825 configs: - config_name: default data_files: - split: train path: data/train-* language: - ar - de - id - ja - es - cs - tl license: apache-2.0 task_categories: - text-generation tags: - multilingual - synthetic - sft pretty_name: PolyglotTeachers-SFT (Synthetic) --- <img alt="Logo for LTL" src="ltl_logo2.svg" width="240px" style="margin-left:'auto' margin-right:'auto' display:'block'"> # PolyglotTeachers-SFT-Synth This dataset contains synthetic supervised fine-tuning examples generated by the best teacher we found in the paper [Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation](), where we systematically characterize what makes a good teacher model. It contains examples across six languages: Arabic, Czech, German, Indonesian, Japanese, Spanish, and Tagalog. **Note:** In our experiments, we subsampled 10k examples per language for training. Here we release the full unfiltered set to enable reproducibility and give researchers the flexibility to construct their own subsamples or training mixtures. ## Dataset Summary - **Languages:** Arabic (ar), Czech (cs), German (de), Indonesian (id), Japanese (ja), Spanish (es), Tagalog (tl) - **Total examples:** 315,596 - **Teacher model:** [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) - **Generation strategies:** generate, respond, translate ### Language Distribution | Language | Examples | |----------|----------| | Indonesian (id) | 85,952 | | German (de) | 83,878 | | Arabic (ar) | 77,770 | | Japanese (ja) | 27,198 | | Tagalog (tl) | 40,875 | | Spanish (es) | 25,609 | | Czech (cs) | 15,189 | ## Data Sources The seed data comes from several multilingual datasets, which were then used to synthesize new examples via Gemma-3-27B-IT. Each source dataset was processed using one of three strategies: **generate** (create new prompt-response pairs from a seed), **respond** (generate a response given a prompt), or **translate** (translate an English example into a target language). * [allenai/WildChat-4.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M): multilingual prompt-response pairs from real user interactions. * [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k): math word problems (English, translated into target languages). * [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered): general chat data (English, translated into target languages). * [nvidia/Helpsteer3](https://huggingface.co/datasets/nvidia/Helpsteer3): multilingual preference data. * [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2): multilingual assistant conversations. * [utter-project/EuroBlocks-SFT-Synthetic-1124](https://huggingface.co/datasets/utter-project/EuroBlocks-SFT-Synthetic-1124): European multilingual synthetic data. * [CohereLabs/aya_collection](https://huggingface.co/datasets/CohereLabs/aya_collection): multilingual instruction data. * [arbml/CIDAR](https://huggingface.co/datasets/arbml/CIDAR): Arabic instruction data. * [indonlp/cendol_collection_v2](https://huggingface.co/datasets/indonlp/cendol_collection_v2): Indonesian instruction data. ## Dataset Structure Each example contains the following fields: | Field | Type | Description | |-------|------|-------------| | `id` | str | Unique identifier | | `source` | str | Source dataset name | | `language` | str | ISO 639-1 language code | | `strategy` | str | Synthesis strategy used (`generate`, `respond`, or `translate`) | | `source_id` | str | Identifier from the source dataset | | `synth_prompt` | str | The prompt used to instruct the teacher model during synthesis | | `model` | str | Teacher model used for generation | | `prompt` | str | The user prompt | | `response` | str | The model response | | `messages` | list | Chat-formatted messages (`role` and `content`) for SFT | ## Usage ```python from datasets import load_dataset ds = load_dataset("ljvmiranda921/PolyglotTeachers-SFT-Synth", split="train") # Filter by language arabic_ds = ds.filter(lambda x: x["language"] == "ar") # Use the messages field directly for SFT print(arabic_ds[0]["messages"]) ``` ## Acknowledgements LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 ([EQUATE](https://gtr.ukri.org/projects?ref=EP%2FY031350%2F1)). This work was performed using joint resources provided by the [Cambridge Service for Data Driven Discovery (CSD3)](https://hpc.cam.ac.uk/high-performance-computing) EP/T022159/1 and the [Isambard AI National AI Research Resource (AIRR)](https://www.bristol.ac.uk/research/centres/bristol-supercomputing/#isambard-ai) ST/AIRR/I-A-I/1023, and the Microsoft Research Grant. LJVM would also like to thank Songbo Hu, Chen Cecilia Liu, Millicent Ochieng, and Felermino Ali for helpful and productive discussions on the project. ## Citation ```bibtex @misc{miranda2026polyglotteachersevaluatinglanguage, title={Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation}, author={Lester James V. Miranda and Ivan Vulić and Anna Korhonen}, year={2026}, eprint={2604.11290}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.11290}, } ```
提供机构:
ljvmiranda921
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作