luth-sft

Name: luth-sft
Creator: maas
Published: 2025-12-05 16:48:29
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/kurakurai/luth-sft

下载链接

链接失效反馈

官方服务：

资源简介：

![Kurakura AI Logo](media/logo_kurakura.png) --- ## Dataset Details This dataset includes all the data used to fine-tune [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct), enhancing their French capabilities on tasks such as instruction following, mathematics, and general knowledge. The models also improved in English thanks to knowledge transfer between the two languages. It contains ~338M tokens in French. Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth). ## Dataset Sources ### Scholar By **Kurakura AI**: [Dataset Link](https://huggingface.co/datasets/kurakurai/scholar). Built from scraped subjects of French Baccalauréat and Preparatory Class (CPGE) entrance exams in mathematics, computer science, and physics. ### Tulu 3 Persona Instruct By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Tulu 3 Persona Math By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Smoltalk2 By **HuggingFaceTB**: [Dataset Link](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2). Extracted French samples only. ### Aya Dataset By **CohereLabs**: [Dataset Link](https://huggingface.co/datasets/CohereLabs/aya_dataset). Extracted French samples only. ### OpenHermes-fr By **legmlai**: [Dataset Link](https://huggingface.co/datasets/legmlai/openhermes-fr). Filtered the dataset. ### CroissantLLM By **Manuel Faysse**: [Dataset Link](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft). Extracted French samples only. ## Citation ```bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {\url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} } ```

![Kurakura AI Logo](media/logo_kurakura.png) --- ## 数据集详情本数据集包含用于微调[**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct)与[**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct)的全部数据，旨在提升这两款模型在指令遵循、数学及通用知识等任务上的法语能力。借助两种语言间的知识迁移，模型的英语能力也得到了同步增强。本数据集包含约3.38亿法语Token，相关数据处理脚本已上传至[GitHub](https://github.com/kurakurai/Luth)。 ## 数据集来源 ### Scholar 数据集由**Kurakura AI**出品：[数据集链接](https://huggingface.co/datasets/kurakurai/scholar)。该数据集源自法国中学毕业会考（Baccalauréat）与预科班（CPGE）入学考试中数学、计算机科学与物理学科的真题内容。 ### Tulu 3 Persona Instruct 数据集由**AllenAI**出品：[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)。处理流程为将原提示词翻译为法语，使用**Qwen3-32B**生成全新答案，随后对数据集进行过滤筛选。 ### Tulu 3 Persona Math 数据集由**AllenAI**出品：[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)。处理流程为将原提示词翻译为法语，使用**Qwen3-32B**生成全新答案，随后对数据集进行过滤筛选。 ### Smoltalk2 数据集由**HuggingFaceTB**出品：[数据集链接](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2)。仅提取其中的法语样本。 ### Aya Dataset 数据集由**CohereLabs**出品：[数据集链接](https://huggingface.co/datasets/CohereLabs/aya_dataset)。仅提取其中的法语样本。 ### OpenHermes-fr 数据集由**legmlai**出品：[数据集链接](https://huggingface.co/datasets/legmlai/openhermes-fr)。对数据集进行了过滤筛选。 ### CroissantLLM 数据集由**Manuel Faysse**出品：[数据集链接](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft)。仅提取其中的法语样本。 ## 引用 bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} }

提供机构：

maas

创建时间：

2025-08-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集