scholar

Name: scholar
Creator: maas
Published: 2025-11-27 16:47:10
License: 暂无描述

魔搭社区2025-11-27 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/kurakurai/scholar

下载链接

链接失效反馈

官方服务：

资源简介：

![Kurakura AI Logo](media/logo_kurakura.png) --- ## Dataset Details This dataset was created to address the lack of high-quality scientific datasets in French. It is based on Baccalauréat and Classes Préparatoires (CPGE) exam questions and their detailed solutions, covering a wide range of subjects, primarily mathematics, physics and chemistry and computer science. The dataset includes 30.3K annotated samples designed to support both educational and research applications in French-language NLP. It was also used to train [**Luth-0.6B-Instruct**](https://huggingface.co/kurakurai/Luth-0.6B-Instruct) and [**Luth-1.7B-Instruct**](https://huggingface.co/kurakurai/Luth-1.7B-Instruct). Our data scripts are available on [GitHub](https://github.com/kurakurai/Luth). ## Datasat Subject Distribution The dataset covers a diverse set of subjects, as illustrated by the distribution below: ![Scholar_Pie_Chart](media/pie_chart.png) ## Data Collection & Processing Pipeline The data was primarily sourced from: - [Prépas.org – Sujet CPGE Archive](https://prepas.org/index.php?module=Sujets) - [Sujetdebac.fr – French Baccalauréat Exams](https://www.sujetdebac.fr/) A total of approximately 14,000 PDFs were collected from these sources. ## Dataset Construction Steps 1. **Filtering and Pairing** -Remove outdated PDFs (before 1990) and low-quality files where content could not be reliably extracted. -Match each exam question PDF with its corresponding correction PDF, and eliminate duplicates. 2. **Extraction** For each (question, correction) pair, extract: - A structured list of questions - The corresponding list of answers 3. **Contextualization** Associate each question with its full context, including: - The subject instructions - Any relevant preceding questions and answers 4. **Refinement using Gemini 2.5** Use Gemini 2.5 to: - Correct LaTeX formatting issues - Fix structural errors - Reformat and standardize answer quality 5. **Sanity Checks and Cleaning** Remove samples with: - Missing data - Formatting errors - Misalignments between questions and answers ## Citation ```bibtex @misc{luth2025kurakurai, title = {Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {\url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} } ```

![Kurakura AI Logo](media/logo_kurakura.png) --- ## 数据集详情本数据集旨在弥补法语高质量科学数据集的空白。其数据源自法国中学毕业会考（Baccalauréat）与大学预科班（Classes Préparatoires，简称CPGE）的考题及详细解答，覆盖多门学科，主要涵盖数学、物理、化学与计算机科学。本数据集包含30.3千个带标注样本，可支持法语自然语言处理（NLP）领域的教育与研究应用。该数据集还被用于训练**Luth-0.6B-Instruct**与**Luth-1.7B-Instruct**两款模型，相关模型链接如下： - https://huggingface.co/kurakurai/Luth-0.6B-Instruct - https://huggingface.co/kurakurai/Luth-1.7B-Instruct 本数据集的处理脚本已开源至[GitHub](https://github.com/kurakurai/Luth)。 ## 数据集学科分布本数据集覆盖多元学科，具体分布如下方饼图所示： ![Scholar_Pie_Chart](media/pie_chart.png) ## 数据采集与处理流程本数据集主要源自以下渠道： - [Prépas.org – CPGE考题档案库](https://prepas.org/index.php?module=Sujets) - [Sujetdebac.fr – 法国中学毕业会考真题平台](https://www.sujetdebac.fr/) 从上述渠道共采集约14000份PDF文档。 ## 数据集构建步骤 1. **筛选与配对** - 移除1990年之前的老旧PDF，以及无法可靠提取内容的低质量文件 - 将每份考题PDF与对应的解答PDF进行匹配，并移除重复样本 2. **内容提取** 针对每一组（考题，解答）配对，提取以下内容： - 结构化的考题列表 - 对应的解答列表 3. **上下文补全** 为每份考题补充完整上下文信息，包括： - 学科考试说明 - 相关的前置考题与解答 4. **基于Gemini 2.5的优化打磨** 使用Gemini 2.5完成以下操作： - 修正LaTeX格式错误 - 修复结构错误 - 重新格式化内容并统一解答质量标准 5. **合理性校验与数据清洗** 移除存在以下问题的样本： - 数据缺失 - 格式错误 - 考题与解答不匹配 ## 引用格式 bibtex @misc{luth2025kurakurai, title = {Luth：面向小语言模型的高效法语专精与跨语言迁移}, author = {Lasbordes, Maxence and Gad, Sinoué}, year = {2025}, howpublished = {url{https://arxiv.org/abs/2510.05846}}, note = {arXiv:2510.05846} }

提供机构：

maas

创建时间：

2025-08-28

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集旨在填补法语高质量科学数据集的空白，基于法国高中毕业会考和大学预科考试的题目与详细解答构建，涵盖数学、物理、化学和计算机科学等多个学科，包含30.3K标注样本，适用于法语NLP的教育和研究场景。数据通过收集和处理约14,000份PDF文件，并经过筛选、提取、上下文关联以及Gemini 2.5优化等步骤精炼而成，曾用于训练Luth模型。

以上内容由遇见数据集搜集并总结生成