luth-sft
收藏魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/kurakurai/luth-sft
下载链接
链接失效反馈官方服务:
资源简介:

---
## Dataset Details
This dataset includes all the data used to fine-tune [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct), enhancing their French capabilities on tasks such as instruction following, mathematics, and general knowledge. The models also improved in English thanks to knowledge transfer between the two languages.
It contains ~338M tokens in French. Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth).
## Dataset Sources
### Scholar
By **Kurakura AI**: [Dataset Link](https://huggingface.co/datasets/kurakurai/scholar).
Built from scraped subjects of French Baccalauréat and Preparatory Class (CPGE) entrance exams in mathematics, computer science, and physics.
### Tulu 3 Persona Instruct
By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following).
Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset.
### Tulu 3 Persona Math
By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math).
Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset.
### Smoltalk2
By **HuggingFaceTB**: [Dataset Link](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2).
Extracted French samples only.
### Aya Dataset
By **CohereLabs**: [Dataset Link](https://huggingface.co/datasets/CohereLabs/aya_dataset).
Extracted French samples only.
### OpenHermes-fr
By **legmlai**: [Dataset Link](https://huggingface.co/datasets/legmlai/openhermes-fr).
Filtered the dataset.
### CroissantLLM
By **Manuel Faysse**: [Dataset Link](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft).
Extracted French samples only.
## Citation
```bibtex
@misc{luth2025kurakurai,
title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {\url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
```

---
## 数据集详情
本数据集包含用于微调[**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct)与[**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct)的全部数据,旨在提升这两款模型在指令遵循、数学及通用知识等任务上的法语能力。借助两种语言间的知识迁移,模型的英语能力也得到了同步增强。
本数据集包含约3.38亿法语Token,相关数据处理脚本已上传至[GitHub](https://github.com/kurakurai/Luth)。
## 数据集来源
### Scholar 数据集
由**Kurakura AI**出品:[数据集链接](https://huggingface.co/datasets/kurakurai/scholar)。
该数据集源自法国中学毕业会考(Baccalauréat)与预科班(CPGE)入学考试中数学、计算机科学与物理学科的真题内容。
### Tulu 3 Persona Instruct 数据集
由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)。
处理流程为将原提示词翻译为法语,使用**Qwen3-32B**生成全新答案,随后对数据集进行过滤筛选。
### Tulu 3 Persona Math 数据集
由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)。
处理流程为将原提示词翻译为法语,使用**Qwen3-32B**生成全新答案,随后对数据集进行过滤筛选。
### Smoltalk2 数据集
由**HuggingFaceTB**出品:[数据集链接](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2)。
仅提取其中的法语样本。
### Aya Dataset 数据集
由**CohereLabs**出品:[数据集链接](https://huggingface.co/datasets/CohereLabs/aya_dataset)。
仅提取其中的法语样本。
### OpenHermes-fr 数据集
由**legmlai**出品:[数据集链接](https://huggingface.co/datasets/legmlai/openhermes-fr)。
对数据集进行了过滤筛选。
### CroissantLLM 数据集
由**Manuel Faysse**出品:[数据集链接](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft)。
仅提取其中的法语样本。
## 引用
bibtex
@misc{luth2025kurakurai,
title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
提供机构:
maas
创建时间:
2025-08-28



