five

scholar

收藏
魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/kurakurai/scholar
下载链接
链接失效反馈
官方服务:
资源简介:
![Kurakura AI Logo](media/logo_kurakura.png) --- ## Dataset Details This dataset was created to address the lack of high-quality scientific datasets in French. It is based on Baccalauréat and Classes Préparatoires (CPGE) exam questions and their detailed solutions, covering a wide range of subjects, primarily mathematics, physics and chemistry and computer science. The dataset includes 30.3K annotated samples designed to support both educational and research applications in French-language NLP. It was also used to train [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct). Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth). ## Datasat Subject Distribution The dataset covers a diverse set of subjects, as illustrated by the distribution below: ![Scholar_Pie_Chart](media/pie_chart.png) ## Data Collection & Processing Pipeline The data was primarily sourced from: - [Prépas.org – Sujet CPGE Archive](https://prepas.org/index.php?module=Sujets) - [Sujetdebac.fr – French Baccalauréat Exams](https://www.sujetdebac.fr/) A total of approximately 14,000 PDFs were collected from these sources. ## Dataset Construction Steps 1. **Filtering and Pairing** -Remove outdated PDFs (before 1990) and low-quality files where content could not be reliably extracted. -Match each exam question PDF with its corresponding correction PDF, and eliminate duplicates. 2. **Extraction** For each (question, correction) pair, extract: - A structured list of questions - The corresponding list of answers 3. **Contextualization** Associate each question with its full context, including: - The subject instructions - Any relevant preceding questions and answers 4. **Refinement using Gemini 2.5** Use Gemini 2.5 to: - Correct LaTeX formatting issues - Fix structural errors - Reformat and standardize answer quality 5. **Sanity Checks and Cleaning** Remove samples with: - Missing data - Formatting errors - Misalignments between questions and answers ## Citation ```bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {\url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} } ```

![Kurakura AI Logo](media/logo_kurakura.png) --- ## 数据集详情 本数据集旨在弥补法语高质量科学数据集的空白。其数据源自法国中学毕业会考(Baccalauréat)与大学预科班(Classes Préparatoires,简称CPGE)的考题及详细解答,覆盖多门学科,主要涵盖数学、物理、化学与计算机科学。 本数据集包含30.3千个带标注样本,可支持法语自然语言处理(NLP)领域的教育与研究应用。该数据集还被用于训练**Luth-0.6B-Instruct**与**Luth-1.7B-Instruct**两款模型,相关模型链接如下: - https://huggingface.co/kurakurai/Luth-0.6B-Instruct - https://huggingface.co/kurakurai/Luth-1.7B-Instruct 本数据集的处理脚本已开源至[GitHub](https://github.com/kurakurai/Luth)。 ## 数据集学科分布 本数据集覆盖多元学科,具体分布如下方饼图所示: ![Scholar_Pie_Chart](media/pie_chart.png) ## 数据采集与处理流程 本数据集主要源自以下渠道: - [Prépas.org – CPGE考题档案库](https://prepas.org/index.php?module=Sujets) - [Sujetdebac.fr – 法国中学毕业会考真题平台](https://www.sujetdebac.fr/) 从上述渠道共采集约14000份PDF文档。 ## 数据集构建步骤 1. **筛选与配对** - 移除1990年之前的老旧PDF,以及无法可靠提取内容的低质量文件 - 将每份考题PDF与对应的解答PDF进行匹配,并移除重复样本 2. **内容提取** 针对每一组(考题,解答)配对,提取以下内容: - 结构化的考题列表 - 对应的解答列表 3. **上下文补全** 为每份考题补充完整上下文信息,包括: - 学科考试说明 - 相关的前置考题与解答 4. **基于Gemini 2.5的优化打磨** 使用Gemini 2.5完成以下操作: - 修正LaTeX格式错误 - 修复结构错误 - 重新格式化内容并统一解答质量标准 5. **合理性校验与数据清洗** 移除存在以下问题的样本: - 数据缺失 - 格式错误 - 考题与解答不匹配 ## 引用格式 bibtex @misc{luth2025kurakurai, title = {Luth:面向小语言模型的高效法语专精与跨语言迁移}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} }
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作