five

Gargantua-R1-Compact

收藏
魔搭社区2026-01-06 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Gargantua-R1-Compact
下载链接
链接失效反馈
官方服务:
资源简介:
![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/dN40mA7_MHh4zcvt35QJK.png) <div style=" background: rgba(255, 61, 61, 0.15); padding: 16px; border-radius: 6px; border: 1px solid rgba(255, 0, 0, 0.3); margin: 16px 0; "> <details> <summary>Gargantua-R1 Distribution</summary> ![Gargantua-R1-Compact Dataset Distribution - visual selection.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/MMnP1qtx9gqxGyIFOXcQe.png) </details> </div> # **Gargantua-R1-Compact(experimental purpose)** > Gargantua-R1-Compact is a large-scale, high-quality reasoning dataset primarily designed for mathematical reasoning and STEM education. It contains approximately 6.67 million problems and solution traces, with a strong emphasis on mathematics (over 70%), as well as coverage of scientific domains, algorithmic challenges, and creative logic puzzles. The dataset is suitable for training and evaluating models in mathematical problem solving, competitive programming, scientific computing, and reasoning competence. --- ## Quick Start with Hugging Face Datasets🤗 ```py pip install -U datasets ``` ```py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact", split="train") ``` --- ## Dataset Composition The dataset is strategically balanced across multiple domains with emphasis on mathematical reasoning: | Category | Percentage | Description | |----------|------------|-------------| | **Math reasoning** | 73.93% | Core mathematical problems, proofs, and computational challenges | | **Diverse scientific domains** | 12.11% | Physics, chemistry, biology, and interdisciplinary scientific problems | | **Competitive coding** | 11.35% | Programming challenges, algorithms, and data structure problems | | **Academic science** | 1.37% | Research-level scientific questions and methodology | | **Creative & analytic reasoning** | 0.95% | Logic puzzles, analytical thinking, and creative problem-solving | | **MLOps/LLMs/diffusion/CUDA** | 0.25% | Machine learning operations and specialized technical content | | **Graphs/charts Data -> json** | 0.06% | Data visualization and structured data interpretation | --- ## Dataset Statistics - **Estimated Total Rows:** 6.67M - **Estimated Full Size:** 144GB - **Dataset Preview:** 232,530 rows - **Preview Size:** 2.23GB - **License:** Apache 2.0 ## Data Sources The dataset is constructed from multiple high-quality sources: - **Reasoning Traces:** Derived from `prithivMLmods/Poseidon-Reasoning-5M` - **Mathematical Reasoning:** Homogeneous traces from `nvidia/open-math-reasoning` - **Custom Problems:** Additional majorly custom modular problems contributed by [prithivMLmods](https://huggingface.co/prithivMLmods) --- ## Key Features ### Mathematical Excellence - **Comprehensive Coverage:** From basic arithmetic to advanced mathematical concepts - **Step-by-Step Reasoning:** Detailed solution traces for complex mathematical problems - **Multiple Approaches:** Various solution methodologies for the same problems ### High-Quality Reasoning Traces - **Structured Solutions:** Problems paired with detailed reasoning steps - **Verification:** Solutions include validation and checking mechanisms - **Educational Value:** Explanations suitable for learning and understanding ### Diverse Problem Types - **Pure Mathematics:** Algebra, calculus, geometry, number theory, and discrete mathematics - **Applied Mathematics:** Statistics, probability, optimization, and mathematical modeling - **Scientific Applications:** Physics problems, chemistry calculations, and biological modeling - **Coding Challenges:** Algorithm implementation, data structures, and computational thinking --- ## Use Cases ### Primary Applications - **Mathematical Reasoning Model Training:** Fine-tuning LLMs for mathematical problem-solving - **STEM Education:** Supporting educational AI systems and tutoring applications - **Research & Development:** Advancing reasoning capabilities in AI systems - **Competitive Programming:** Training models for coding competitions and algorithmic challenges ### Secondary Applications - **Scientific Computing:** Supporting scientific research and computational science - **Academic Assessment:** Automated grading and problem generation systems - **Reasoning Evaluation:** Benchmarking and testing reasoning capabilities --- ## Dataset Structure The dataset follows a structured format with the following key components: ```json { "problem": "Problem statement or question", "solution": "Detailed step-by-step solution with reasoning traces", "category": "Domain classification (math_reasoning, coding, etc.)", "difficulty": "Problem complexity level", "source": "Origin of the problem" } ``` ## Response format: ```py <think> -- reasoning trace -- </think> -- answer -- ``` --- ## Quality Assurance ### Data Curation - **Expert Review:** Problems vetted by domain experts - **Solution Verification:** All solutions mathematically verified - **Format Consistency:** Standardized structure across all entries - **Duplicate Removal:** Comprehensive deduplication process ### Reasoning Quality - **Pedagogical Value:** Solutions include educational explanations - **Multiple Verification:** Cross-checking of mathematical solutions - **Error Correction:** Systematic error identification and correction - **Clarity Standards:** Clear, understandable reasoning traces ## Getting Started ### Installation and Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact") # Access different splits train_data = dataset['train'] print(f"Dataset size: {len(train_data)}") # Example usage for example in train_data.take(5): print("Problem:", example['problem']) print("Solution:", example['solution']) print("---") ``` --- ## Recommended Use - **Fine-tuning:** Use for mathematical reasoning model fine-tuning - **Evaluation:** Benchmark mathematical reasoning capabilities - **Research:** Academic research on reasoning and problem-solving - **Education:** Support educational AI applications ## Performance Benchmarks Models trained on Gargantua-R1-Compact show significant improvements in: - Mathematical problem-solving accuracy - Step-by-step reasoning quality - Multi-domain reasoning transfer - Competitive programming performance --- ## Contributing We welcome contributions to improve the dataset quality: - **Problem Submission:** Submit high-quality mathematical problems - **Solution Review:** Help verify and improve existing solutions - **Error Reporting:** Report any errors or inconsistencies - **Domain Expansion:** Contribute problems in underrepresented areas ## Citation <div style=" background: rgba(255, 61, 61, 0.15); padding: 16px; border-radius: 6px; border: 1px solid rgba(255, 0, 0, 0.3); margin: 16px 0; "> <details> <summary>Gargantua-R1-Compact</summary> ```bibtex @misc{prithiv_sakthi_2025, author = { Prithiv Sakthi }, title = { Gargantua-R1-Compact (Revision 522d6fb) }, year = 2025, url = { https://huggingface.co/datasets/prithivMLmods/Gargantua-R1-Compact }, doi = { 10.57967/hf/6176 }, publisher = { Hugging Face } } ``` </details> </div> --- ## Limitations and Considerations ### Known Limitations - **Language:** Primarily English-language content - **Domain Bias:** Heavy emphasis on mathematical reasoning - **Cultural Context:** Problems may reflect specific educational contexts - **Complexity Range:** Varying difficulty levels within categories ### Ethical Considerations - **Educational Integrity:** Should not be used to complete academic assignments - **Bias Awareness:** Users should be aware of potential biases in problem selection - **Responsible Use:** Follow academic and research ethics guidelines --- ## License This dataset is released under the Apache 2.0 License, allowing for both commercial and non-commercial use with proper attribution. ## Version History - **v1.0:** Initial release with 6.67M reasoning traces - **Preview:** 232K sample for evaluation and testing | Maintained by | Last Updated | |---------------|--------------| | **[prithivMLmods](https://huggingface.co/prithivMLmods)** | **Aug 2025** |

# **Gargantua-R1-Compact(实验用途)** > Gargantua-R1-Compact是一款大规模高质量推理数据集,主要面向数学推理与STEM教育场景研发。该数据集包含约667万道题目与解题轨迹,其中数学类题目占比超70%,同时涵盖科学领域、算法挑战与创意逻辑谜题。该数据集可用于训练与评估模型的数学解题、竞赛编程、科学计算与推理能力。 --- ## 使用Hugging Face Datasets快速入门🤗 py pip install -U datasets py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact", split="train") --- ## 数据集构成 该数据集在多领域间进行了策略性平衡,核心侧重数学推理: | 类别 | 占比 | 描述 | |----------|------------|-------------| | **数学推理** | 73.93% | 核心数学题目、证明题与计算挑战 | | **多领域科学** | 12.11% | 物理、化学、生物及跨学科科学问题 | | **竞赛编程** | 11.35% | 编程挑战、算法与数据结构题目 | | **学术科学** | 1.37% | 研究级科学问题与方法论 | | **创意与分析推理** | 0.95% | 逻辑谜题、分析思维与创意解题 | | **MLOps/大语言模型(LLMs)/扩散模型/CUDA** | 0.25% | 机器学习运维与专业技术内容 | | **图表/结构化数据(JSON)** | 0.06% | 数据可视化与结构化数据解读 | --- ## 数据集统计信息 - **预估总条目数:** 667万 - **预估完整数据集大小:** 144GB - **数据集预览量:** 232,530条 - **预览集大小:** 2.23GB - **授权协议:** Apache 2.0 ## 数据来源 该数据集由多个高质量数据源构建而成: - **推理轨迹:** 源自`prithivMLmods/Poseidon-Reasoning-5M` - **数学推理:** 取自`nvidia/open-math-reasoning`的标准化推理轨迹 - **自定义题目:** 由[prithivMLmods](https://huggingface.co/prithivMLmods)贡献的大量自研模块化题目 --- ## 核心特性 ### 数学领域优势 - **覆盖全面:** 从基础算术到高等数学概念均有涉及 - **分步推理:** 针对复杂数学题目的详细解题轨迹 - **多解路径:** 同一题目提供多种解题方法 ### 高质量推理轨迹 - **结构化解答:** 题目与详细推理步骤一一对应 - **验证机制:** 解答包含校验与核查流程 - **教育价值:** 解释内容适配学习与理解需求 ### 多样题型覆盖 - **纯数学领域:** 代数、微积分、几何、数论与离散数学 - **应用数学领域:** 统计学、概率论、优化与数学建模 - **科学应用场景:** 物理题、化学计算与生物建模 - **编程挑战题型:** 算法实现、数据结构与计算思维 --- ## 应用场景 ### 核心应用 - **数学推理模型训练:** 微调大语言模型以提升数学解题能力 - **STEM教育支持:** 支撑教育类AI系统与辅导应用 - **学术研发创新:** 推进AI系统推理能力的迭代升级 - **竞赛编程训练:** 训练模型参与编程竞赛与算法挑战 ### 次要应用场景 - **科学计算支撑:** 助力科学研究与计算科学领域工作 - **学术自动化评估:** 自动化评分与题目生成系统 - **推理能力评测:** 基准测试与推理能力验证 --- ## 数据集结构 该数据集采用标准化格式,核心字段如下: json { "problem": "题目描述或问题陈述", "solution": "包含推理轨迹的详细分步解答", "category": "领域分类(如math_reasoning、coding等)", "difficulty": "题目复杂度等级", "source": "题目来源" } ## 质量保障体系 ### 数据治理流程 - **专家审核机制:** 题目经领域专家核验通过 - **解法正确性验证:** 所有解法均经过数学严谨性校验 - **格式统一规范:** 所有条目采用标准化结构 - **重复条目清理:** 全面完成去重处理流程 ### 推理质量把控 - **教学适配性:** 解答包含适配教学场景的解释内容 - **多重校验机制:** 数学解法经交叉核查确认准确性 - **错误修正流程:** 系统性识别并修正所有错误内容 - **清晰表达标准:** 推理轨迹遵循清晰易懂的撰写规范 ## 快速上手指南 ### 安装与使用示例 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact") # 访问不同数据划分 train_data = dataset['train'] print(f"数据集规模:{len(train_data)}") # 示例用法 for example in train_data.take(5): print("题目:", example['problem']) print("解答:", example['solution']) print("---") --- ## 推荐使用场景 - **模型微调训练:** 用于数学推理模型的微调优化 - **性能基准评测:** 测试与评估数学推理能力 - **学术研究方向:** 推理与解题领域的学术研究 - **教育AI开发:** 支撑教育类AI产品的研发落地 ## 性能基准表现 基于Gargantua-R1-Compact训练的模型在以下维度实现显著性能提升: - 数学解题准确率 - 分步推理逻辑质量 - 多领域推理迁移能力 - 竞赛编程竞赛表现 --- ## 贡献指南 我们欢迎各类贡献以持续提升数据集质量: - **题目提交:** 提交高质量数学与相关领域题目 - **解法审核优化:** 协助校验并优化现有解题方案 - **错误上报:** 报告任何错误或内容不一致之处 - **领域拓展贡献:** 补充小众领域的题目资源 ## 引用格式 bibtex @misc{prithiv_sakthi_2025, author = { Prithiv Sakthi }, title = { Gargantua-R1-Compact (Revision 522d6fb) }, year = 2025, url = { https://huggingface.co/datasets/prithivMLmods/Gargantua-R1-Compact }, doi = { 10.57967/hf/6176 }, publisher = { Hugging Face } } --- ## 局限性与注意事项 ### 已知局限性 - **语言限制:** 数据集主要以英文内容为主 - **领域偏向:** 高度侧重数学推理相关内容 - **文化语境:** 题目可能反映特定地区的教育背景 - **难度跨度:** 各分类内题目难度等级差异较大 ### 伦理考量要点 - **教育诚信:** 不得将数据集用于完成学术作业等违规场景 - **偏见感知:** 用户需注意题目选择可能存在的潜在偏见 - **负责任使用:** 遵循学术与研究伦理规范开展相关工作 --- ## 授权协议 本数据集采用Apache 2.0协议发布,允许在标注原作者的前提下进行商业与非商业使用。 ## 版本历史 - **v1.0:** 首次正式发布,包含667万条推理轨迹 - **预览版:** 提供23万条样本用于前期评估与测试 | 维护方 | 最后更新时间 | |---------------|--------------| | **[prithivMLmods](https://huggingface.co/prithivMLmods)** | **2025年8月** |
提供机构:
maas
创建时间:
2025-08-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作