five

sammybow/luth-sft

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sammybow/luth-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: luth_scholar num_bytes: 79535411 num_examples: 30282 - name: luth_croissantllm num_bytes: 52420415 num_examples: 13872 - name: luth_smoltalk2 num_bytes: 112793189 num_examples: 50474 - name: luth_aya_dataset num_bytes: 896669 num_examples: 1422 - name: luth_tulu3_persona_math num_bytes: 235950332 num_examples: 47440 - name: luth_openhermes num_bytes: 626036280 num_examples: 406516 - name: luth_tulu3_persona_instruct num_bytes: 49432042 num_examples: 20617 download_size: 574642048 dataset_size: 1157064338 configs: - config_name: default data_files: - split: luth_scholar path: data/luth_scholar-* - split: luth_croissantllm path: data/luth_croissantllm-* - split: luth_smoltalk2 path: data/luth_smoltalk2-* - split: luth_aya_dataset path: data/luth_aya_dataset-* - split: luth_tulu3_persona_math path: data/luth_tulu3_persona_math-* - split: luth_openhermes path: data/luth_openhermes-* - split: luth_tulu3_persona_instruct path: data/luth_tulu3_persona_instruct-* license: odc-by task_categories: - text-generation language: - fr size_categories: - 100K<n<1M tags: - synthetic - math - instruction --- ![Kurakura AI Logo](media/logo_kurakura.png) --- ## Dataset Details This dataset includes all the data used to fine-tune [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct), enhancing their French capabilities on tasks such as instruction following, mathematics, and general knowledge. The models also improved in English thanks to knowledge transfer between the two languages. It contains ~338M tokens in French. Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth). ## Dataset Sources ### Scholar By **Kurakura AI**: [Dataset Link](https://huggingface.co/datasets/kurakurai/scholar). Built from scraped subjects of French Baccalauréat and Preparatory Class (CPGE) entrance exams in mathematics, computer science, and physics. ### Tulu 3 Persona Instruct By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Tulu 3 Persona Math By **AllenAI**: [Dataset Link](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math). Translated prompts to French, generated new answers with **Qwen3-32B**, then filtered the dataset. ### Smoltalk2 By **HuggingFaceTB**: [Dataset Link](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2). Extracted French samples only. ### Aya Dataset By **CohereLabs**: [Dataset Link](https://huggingface.co/datasets/CohereLabs/aya_dataset). Extracted French samples only. ### OpenHermes-fr By **legmlai**: [Dataset Link](https://huggingface.co/datasets/legmlai/openhermes-fr). Filtered the dataset. ### CroissantLLM By **Manuel Faysse**: [Dataset Link](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft). Extracted French samples only. ## Citation ```bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {\url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} } ```

--- dataset_info: 特征: - 名称: messages 列表: - 名称: content 数据类型: 字符串 - 名称: role 数据类型: 字符串 数据集划分: - 名称: luth_scholar 字节数: 79535411 样本数量: 30282 - 名称: luth_croissantllm 字节数: 52420415 样本数量: 13872 - 名称: luth_smoltalk2 字节数: 112793189 样本数量: 50474 - 名称: luth_aya_dataset 字节数: 896669 样本数量: 1422 - 名称: luth_tulu3_persona_math 字节数: 235950332 样本数量: 47440 - 名称: luth_openhermes 字节数: 626036280 样本数量: 406516 - 名称: luth_tulu3_persona_instruct 字节数: 49432042 样本数量: 20617 下载大小: 574642048 数据集总大小: 1157064338 配置项: - 配置名称: default 数据文件: - 划分: luth_scholar 路径: data/luth_scholar-* - 划分: luth_croissantllm 路径: data/luth_croissantllm-* - 划分: luth_smoltalk2 路径: data/luth_smoltalk2-* - 划分: luth_aya_dataset 路径: data/luth_aya_dataset-* - 划分: luth_tulu3_persona_math 路径: data/luth_tulu3_persona_math-* - 划分: luth_openhermes 路径: data/luth_openhermes-* - 划分: luth_tulu3_persona_instruct 路径: data/luth_tulu3_persona_instruct-* 许可证: ODC-BY 任务类别: - 文本生成 语言: - 法语 规模类别: - 100K<n<1M 标签: - 合成数据 - 数学 - 指令跟随 --- ![Kurakura AI 标志](media/logo_kurakura.png) --- ## 数据集详情 本数据集包含用于微调**Luth-0.6B-Instruct**与**Luth-1.7B-Instruct**的全部训练数据,可提升这两个模型在指令跟随、数学推理与通用知识等任务上的法语处理能力。得益于跨语言知识迁移,模型的英语表现也得到了优化。 该数据集包含约3.38亿个法语Token,其数据预处理脚本已开源至[GitHub](https://github.com/kurakurai/Luth)。 ## 数据集来源 ### Scholar 子集 由**Kurakura AI**制作:[数据集链接](https://huggingface.co/datasets/kurakurai/scholar)。 该子集源自爬取的法国高中毕业会考(Baccalauréat)与预科班(CPGE)入学考试中数学、计算机科学与物理科目的真题。 ### Tulu 3 Persona Instruct 子集 由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)。 将原提示词翻译为法语,使用**Qwen3-32B**生成全新回答后对数据集进行筛选。 ### Tulu 3 Persona Math 子集 由**AllenAI**出品:[数据集链接](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)。 将原提示词翻译为法语,使用**Qwen3-32B**生成全新回答后对数据集进行筛选。 ### Smoltalk2 子集 由**HuggingFaceTB**提供:[数据集链接](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2)。 仅提取其中的法语样本。 ### Aya Dataset 子集 由**CohereLabs**制作:[数据集链接](https://huggingface.co/datasets/CohereLabs/aya_dataset)。 仅提取其中的法语样本。 ### OpenHermes-fr 子集 由**legmlai**整理:[数据集链接](https://huggingface.co/datasets/legmlai/openhermes-fr)。 对数据集进行了筛选处理。 ### CroissantLLM 子集 由**Manuel Faysse**制作:[数据集链接](https://huggingface.co/datasets/croissantllm/CroissantLLM-2201-sft)。 仅提取其中的法语样本。 ## 引用 bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} }
提供机构:
sammybow
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作