five

Kreyol/kakugo-hat

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Kreyol/kakugo-hat
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - hat task_categories: - text-generation tags: - low-resource-language - data-distillation - conversation - hat - Haitian Creole --- # Kakugo Haitian Creole dataset [[Paper]](https://arxiv.org/abs/2601.14051) [[Code]](https://github.com/Peter-Devine/kakugo) [[Model]](https://huggingface.co/ptrdvn/kakugo-3B-hat) <p align="center"> A synthetically generated conversation dataset for training in Haitian Creole. <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/hmRaNkmPAV8rakBOhtgZI.png" alt="Globe Image" width="400"/> </p> This dataset contains synthetic conversational data and translated instructions designed to train Small Language Models (SLMs) for **Haitian Creole**. It was generated using the **Kakugo** pipeline, a method for distilling high-quality capabilities from a large teacher model into low-resource language models. The teacher model used to generate this dataset was [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b). For Kakugo in other languages, check out the [model](https://huggingface.co/collections/ptrdvn/kakugo-models) and [dataset](https://huggingface.co/collections/ptrdvn/kakugo-datasets) collections. ## Creation Methodology This dataset was created using the automated Kakugo pipeline described in [our paper](https://arxiv.org/abs/2601.14051). Full details of how this dataset was created (and how you can make a dataset in your own chosen language) can be found on our [Github repo](https://github.com/Peter-Devine/kakugo). ### 1. Synthetic Data Generation We prompted a teacher model (**GPT-OSS 120B**) to generate diverse prompts in Haitian Creole using three strategies: * **Topic-Based:** Prompts derived from a tree of general and language-specific topics (e.g., local culture, history, daily life). * **Scenario-Based:** Prompts based on realistic user scenarios where an AI assistant would be useful (e.g., "planning a trip," "explaining a concept"). * **Context-Based:** Prompts generated by feeding the teacher model random text snippets from [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) in Haitian Creole and asking it to perform tasks like summarization, translation, or QA based on that text. For every generated prompt, the teacher model produced a response. Crucially, we captured the teacher's **reasoning traces** (chain-of-thought) to help the student model learn *how* to think, not just what to say. ### 2. Instruction Translation To bolster general instruction-following capabilities, we sampled high-quality English instructions from the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) (7M_core) dataset. * These were translated into Haitian Creole using the teacher model. * Strict filtering was applied: conversations were discarded if the translated length was disproportionate (indicating hallucination or failure) or if the formatting was broken. ## Usage & Limitations * **Thinking Mode:** The dataset includes specific system prompts that trigger "thinking mode." Thinking mode is trained for the data we have reasoning traces for - only our synthetically generated data. When training on this data, the model learns to output `<think>` tags containing reasoning steps only when prompted. * **Synthetic Nature:** While the teacher model is highly capable, this data is synthetic or machine-translated. This dataset is NOT PERFECT! # Credit This model was trained by [@ptrdvn](https://huggingface.co/ptrdvn) If you use this dataset, please cite the Kakugo paper: ```bibtex @article{devine2026kakugo, title={Kakugo: Distillation of Low-Resource Languages into Small Language Models}, author={Devine, Peter and Sanni, Mardhiyah and Adilazuarda, Farid and Loizaga, Julieta Gil and Haddow, Barry}, journal={arXiv preprint arXiv:2601.14051}, year={2026} } ```
提供机构:
Kreyol
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作