five

luth-sft

收藏
魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/kurakurai/luth-sft
下载链接
链接失效反馈
官方服务:
资源简介:
![Kurakura AI Logo](media/logo_kurakura.png) --- ## Dataset Details This dataset includes all the data used to fine-tune [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct), enhancing their French capabilities on tasks such as instruction following, mathematics, and general knowledge. The models also improved in English thanks to knowledge transfer between the two languages. It contains ~338M tokens in French. Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth). ## Dataset Sources ### Scholar By **Kurakura AI**: [Dataset Link](https://huggingface.co/datasets/kurakurai/scholar). Built from scraped subjects of French Baccalauréat and Preparatory Class (CPGE) entrance exams in mathematics, computer science, and physics. ### Tulu 3 Persona Instruct By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Tulu 3 Persona Math By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Smoltalk2 By **HuggingFaceTB**: [Dataset Link](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2). Extracted French samples only. ### Aya Dataset By **CohereLabs**: [Dataset Link](https://huggingface.co/datasets/CohereLabs/aya_dataset). Extracted French samples only. ### OpenHermes-fr By **legmlai**: [Dataset Link](https://huggingface.co/datasets/legmlai/openhermes-fr). Filtered the dataset. ### CroissantLLM By **Manuel Faysse**: [Dataset Link](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft). Extracted French samples only. ## Citation ```bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {\url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} } ```

![Kurakura AI Logo](media/logo_kurakura.png) --- ## 数据集详情 本数据集包含用于微调[**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct)与[**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct)的全部数据,旨在提升这两款模型在指令遵循、数学及通用知识等任务上的法语能力。借助两种语言间的知识迁移,模型的英语能力也得到了同步增强。 本数据集包含约3.38亿法语Token,相关数据处理脚本已上传至[GitHub](https://github.com/kurakurai/Luth)。 ## 数据集来源 ### Scholar 数据集 由**Kurakura AI**出品:[数据集链接](https://huggingface.co/datasets/kurakurai/scholar)。 该数据集源自法国中学毕业会考(Baccalauréat)与预科班(CPGE)入学考试中数学、计算机科学与物理学科的真题内容。 ### Tulu 3 Persona Instruct 数据集 由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)。 处理流程为将原提示词翻译为法语,使用**Qwen3-32B**生成全新答案,随后对数据集进行过滤筛选。 ### Tulu 3 Persona Math 数据集 由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)。 处理流程为将原提示词翻译为法语,使用**Qwen3-32B**生成全新答案,随后对数据集进行过滤筛选。 ### Smoltalk2 数据集 由**HuggingFaceTB**出品:[数据集链接](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2)。 仅提取其中的法语样本。 ### Aya Dataset 数据集 由**CohereLabs**出品:[数据集链接](https://huggingface.co/datasets/CohereLabs/aya_dataset)。 仅提取其中的法语样本。 ### OpenHermes-fr 数据集 由**legmlai**出品:[数据集链接](https://huggingface.co/datasets/legmlai/openhermes-fr)。 对数据集进行了过滤筛选。 ### CroissantLLM 数据集 由**Manuel Faysse**出品:[数据集链接](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft)。 仅提取其中的法语样本。 ## 引用 bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} }
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作