sammybow/luth-sft
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sammybow/luth-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: luth_scholar
num_bytes: 79535411
num_examples: 30282
- name: luth_croissantllm
num_bytes: 52420415
num_examples: 13872
- name: luth_smoltalk2
num_bytes: 112793189
num_examples: 50474
- name: luth_aya_dataset
num_bytes: 896669
num_examples: 1422
- name: luth_tulu3_persona_math
num_bytes: 235950332
num_examples: 47440
- name: luth_openhermes
num_bytes: 626036280
num_examples: 406516
- name: luth_tulu3_persona_instruct
num_bytes: 49432042
num_examples: 20617
download_size: 574642048
dataset_size: 1157064338
configs:
- config_name: default
data_files:
- split: luth_scholar
path: data/luth_scholar-*
- split: luth_croissantllm
path: data/luth_croissantllm-*
- split: luth_smoltalk2
path: data/luth_smoltalk2-*
- split: luth_aya_dataset
path: data/luth_aya_dataset-*
- split: luth_tulu3_persona_math
path: data/luth_tulu3_persona_math-*
- split: luth_openhermes
path: data/luth_openhermes-*
- split: luth_tulu3_persona_instruct
path: data/luth_tulu3_persona_instruct-*
license: odc-by
task_categories:
- text-generation
language:
- fr
size_categories:
- 100K<n<1M
tags:
- synthetic
- math
- instruction
---

---
## Dataset Details
This dataset includes all the data used to fine-tune [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct), enhancing their French capabilities on tasks such as instruction following, mathematics, and general knowledge. The models also improved in English thanks to knowledge transfer between the two languages.
It contains ~338M tokens in French. Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth).
## Dataset Sources
### Scholar
By **Kurakura AI**: [Dataset Link](https://huggingface.co/datasets/kurakurai/scholar).
Built from scraped subjects of French Baccalauréat and Preparatory Class (CPGE) entrance exams in mathematics, computer science, and physics.
### Tulu 3 Persona Instruct
By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following).
Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset.
### Tulu 3 Persona Math
By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math).
Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset.
### Smoltalk2
By **HuggingFaceTB**: [Dataset Link](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2).
Extracted French samples only.
### Aya Dataset
By **CohereLabs**: [Dataset Link](https://huggingface.co/datasets/CohereLabs/aya_dataset).
Extracted French samples only.
### OpenHermes-fr
By **legmlai**: [Dataset Link](https://huggingface.co/datasets/legmlai/openhermes-fr).
Filtered the dataset.
### CroissantLLM
By **Manuel Faysse**: [Dataset Link](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft).
Extracted French samples only.
## Citation
```bibtex
@misc{luth2025kurakurai,
title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {\url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
```
---
dataset_info:
特征:
- 名称: messages
列表:
- 名称: content
数据类型: 字符串
- 名称: role
数据类型: 字符串
数据集划分:
- 名称: luth_scholar
字节数: 79535411
样本数量: 30282
- 名称: luth_croissantllm
字节数: 52420415
样本数量: 13872
- 名称: luth_smoltalk2
字节数: 112793189
样本数量: 50474
- 名称: luth_aya_dataset
字节数: 896669
样本数量: 1422
- 名称: luth_tulu3_persona_math
字节数: 235950332
样本数量: 47440
- 名称: luth_openhermes
字节数: 626036280
样本数量: 406516
- 名称: luth_tulu3_persona_instruct
字节数: 49432042
样本数量: 20617
下载大小: 574642048
数据集总大小: 1157064338
配置项:
- 配置名称: default
数据文件:
- 划分: luth_scholar
路径: data/luth_scholar-*
- 划分: luth_croissantllm
路径: data/luth_croissantllm-*
- 划分: luth_smoltalk2
路径: data/luth_smoltalk2-*
- 划分: luth_aya_dataset
路径: data/luth_aya_dataset-*
- 划分: luth_tulu3_persona_math
路径: data/luth_tulu3_persona_math-*
- 划分: luth_openhermes
路径: data/luth_openhermes-*
- 划分: luth_tulu3_persona_instruct
路径: data/luth_tulu3_persona_instruct-*
许可证: ODC-BY
任务类别:
- 文本生成
语言:
- 法语
规模类别:
- 100K<n<1M
标签:
- 合成数据
- 数学
- 指令跟随
---

---
## 数据集详情
本数据集包含用于微调**Luth-0.6B-Instruct**与**Luth-1.7B-Instruct**的全部训练数据,可提升这两个模型在指令跟随、数学推理与通用知识等任务上的法语处理能力。得益于跨语言知识迁移,模型的英语表现也得到了优化。
该数据集包含约3.38亿个法语Token,其数据预处理脚本已开源至[GitHub](https://github.com/kurakurai/Luth)。
## 数据集来源
### Scholar 子集
由**Kurakura AI**制作:[数据集链接](https://huggingface.co/datasets/kurakurai/scholar)。
该子集源自爬取的法国高中毕业会考(Baccalauréat)与预科班(CPGE)入学考试中数学、计算机科学与物理科目的真题。
### Tulu 3 Persona Instruct 子集
由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)。
将原提示词翻译为法语,使用**Qwen3-32B**生成全新回答后对数据集进行筛选。
### Tulu 3 Persona Math 子集
由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)。
将原提示词翻译为法语,使用**Qwen3-32B**生成全新回答后对数据集进行筛选。
### Smoltalk2 子集
由**HuggingFaceTB**提供:[数据集链接](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2)。
仅提取其中的法语样本。
### Aya Dataset 子集
由**CohereLabs**制作:[数据集链接](https://huggingface.co/datasets/CohereLabs/aya_dataset)。
仅提取其中的法语样本。
### OpenHermes-fr 子集
由**legmlai**整理:[数据集链接](https://huggingface.co/datasets/legmlai/openhermes-fr)。
对数据集进行了筛选处理。
### CroissantLLM 子集
由**Manuel Faysse**制作:[数据集链接](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft)。
仅提取其中的法语样本。
## 引用
bibtex
@misc{luth2025kurakurai,
title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
提供机构:
sammybow



