scholar
收藏魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/kurakurai/scholar
下载链接
链接失效反馈官方服务:
资源简介:

---
## Dataset Details
This dataset was created to address the lack of high-quality scientific datasets in French. It is based on Baccalauréat and Classes Préparatoires (CPGE) exam questions and their detailed solutions, covering a wide range of subjects, primarily mathematics, physics and chemistry and computer science.
The dataset includes 30.3K annotated samples designed to support both educational and research applications in French-language NLP.
It was also used to train [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct).
Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth).
## Datasat Subject Distribution
The dataset covers a diverse set of subjects, as illustrated by the distribution below:

## Data Collection & Processing Pipeline
The data was primarily sourced from:
- [Prépas.org – Sujet CPGE Archive](https://prepas.org/index.php?module=Sujets)
- [Sujetdebac.fr – French Baccalauréat Exams](https://www.sujetdebac.fr/)
A total of approximately 14,000 PDFs were collected from these sources.
## Dataset Construction Steps
1. **Filtering and Pairing**
-Remove outdated PDFs (before 1990) and low-quality files where content could not be reliably extracted.
-Match each exam question PDF with its corresponding correction PDF, and eliminate duplicates.
2. **Extraction**
For each (question, correction) pair, extract:
- A structured list of questions
- The corresponding list of answers
3. **Contextualization**
Associate each question with its full context, including:
- The subject instructions
- Any relevant preceding questions and answers
4. **Refinement using Gemini 2.5**
Use Gemini 2.5 to:
- Correct LaTeX formatting issues
- Fix structural errors
- Reformat and standardize answer quality
5. **Sanity Checks and Cleaning**
Remove samples with:
- Missing data
- Formatting errors
- Misalignments between questions and answers
## Citation
```bibtex
@misc{luth2025kurakurai,
title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {\url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
```

---
## 数据集详情
本数据集旨在弥补法语高质量科学数据集的空白。其数据源自法国中学毕业会考(Baccalauréat)与大学预科班(Classes Préparatoires,简称CPGE)的考题及详细解答,覆盖多门学科,主要涵盖数学、物理、化学与计算机科学。
本数据集包含30.3千个带标注样本,可支持法语自然语言处理(NLP)领域的教育与研究应用。该数据集还被用于训练**Luth-0.6B-Instruct**与**Luth-1.7B-Instruct**两款模型,相关模型链接如下:
- https://huggingface.co/kurakurai/Luth-0.6B-Instruct
- https://huggingface.co/kurakurai/Luth-1.7B-Instruct
本数据集的处理脚本已开源至[GitHub](https://github.com/kurakurai/Luth)。
## 数据集学科分布
本数据集覆盖多元学科,具体分布如下方饼图所示:

## 数据采集与处理流程
本数据集主要源自以下渠道:
- [Prépas.org – CPGE考题档案库](https://prepas.org/index.php?module=Sujets)
- [Sujetdebac.fr – 法国中学毕业会考真题平台](https://www.sujetdebac.fr/)
从上述渠道共采集约14000份PDF文档。
## 数据集构建步骤
1. **筛选与配对**
- 移除1990年之前的老旧PDF,以及无法可靠提取内容的低质量文件
- 将每份考题PDF与对应的解答PDF进行匹配,并移除重复样本
2. **内容提取**
针对每一组(考题,解答)配对,提取以下内容:
- 结构化的考题列表
- 对应的解答列表
3. **上下文补全**
为每份考题补充完整上下文信息,包括:
- 学科考试说明
- 相关的前置考题与解答
4. **基于Gemini 2.5的优化打磨**
使用Gemini 2.5完成以下操作:
- 修正LaTeX格式错误
- 修复结构错误
- 重新格式化内容并统一解答质量标准
5. **合理性校验与数据清洗**
移除存在以下问题的样本:
- 数据缺失
- 格式错误
- 考题与解答不匹配
## 引用格式
bibtex
@misc{luth2025kurakurai,
title = {Luth:面向小语言模型的高效法语专精与跨语言迁移},
author = {Lasbordes, Maxence and Gad, Sinoué},
year = {2025},
howpublished = {url{https://arxiv.org/abs/2510.05846}},
note = {arXiv:2510.05846}
}
提供机构:
maas
创建时间:
2025-08-28



