five

CodeX-2M-Thinking

收藏
魔搭社区2026-05-15 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/XenArcAI/CodeX-2M-Thinking
下载链接
链接失效反馈
官方服务:
资源简介:
# Modotte --- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png" alt="CodeX Banner" width="70%" style="border-radius:15px;" /> > Note: This dataset is part of the lineup CodeX by Modotte. You can get lots of datasets in this same lineup, with the main focus on providing very high-quality datasets for model training and fine-tuning. This dataset is fully synthetic, curated from high-quality public sources and enhanced with synthetic data generated using both closed and open-source models. It serves as a strong foundation for instruction-based model tuning and fine-tuning, offering one of the most refined and extensive corpora available for coding tasks with reasoning. ### Key Features - **Scale**: 2 million examples of highly curated coding data - **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering - **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections - **Thinking Focus**: Step-by-step reasoning included in responses, optimized for instruction training with detailed thought processes - **Accuracy**: Verified code executions and correctness validation using automated testing frameworks ## Dataset Overview **CodeX-2M-Thinking** is a meticulously curated coding dataset designed specifically for instruction-based model tuning and fine-tuning of existing models with enhanced code generation and reasoning capabilities. This fully synthetic dataset represents a large and comprehensively filtered corpus of coding data on the Hugging Face platform, emphasizing a thinking approach with step-by-step reasoning for deeper model training. ## How to Use? ```bash pip install -U datasets fsspec ``` ```python from datasets import load_dataset dataset = load_dataset("Modotte/CodeX-2M-Thinking") ``` ### Key Features - **Scale**: 2 million examples of highly curated coding data - **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering - **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections - **Thinking Focus**: Step-by-step reasoning included in responses, optimized for instruction training with detailed thought processes - **Accuracy**: Verified code executions and correctness validation using automated testing frameworks ## Data Curation Process This dataset has been carefully constructed through a fully synthetic approach, selectively generating and merging examples to enrich the overall dataset for generation models. ### Data Sources - **High-Quality Existing Datasets**: Curated from multiple premium coding datasets available online (e.g., from NVIDIA and Modotte's internal collections) - **Synthetic Generation**: Fully generated using both closed-source and open-source language models (Modotte) - **Expert Validation**: Human-verified code solutions, reasoning, and implementations (Modotte) ### Filtering Pipeline Our rigorous filtering process includes open and closed-source filtering techniques, ensuring only the highest-quality examples are retained: 1. **Deduplication**: Removal of duplicate problems and code solutions 2. **Normalization**: Code formatting standardization and syntax cleanup 3. **Stopword Processing**: Intelligent removal of non-essential comments or boilerplate 4. **Quality Scoring**: Multi-dimensional quality assessment using metrics like code complexity, readability, and efficiency 5. **Ranking-Based Filtering**: Advanced ranking algorithms to prioritize top-tier examples based on relevance, novelty, and utility 6. **Expert Selections**: Manual curation by coding experts to select exemplary samples 7. **Answer Verification**: Automated testing and execution validation using frameworks like pytest or unit tests 8. **Content Filtering**: Removal of inappropriate, outdated, or incorrect code 9. **Diversity Balancing**: Ensuring balanced representation across languages and domains through algorithmic sampling ### Problem Complexity Distribution - **Basic Level** (30%): Fundamental programming concepts, simple syntax, and basic operations - **Intermediate Level** (30%): Multi-function problems requiring modular code and basic algorithms - **Advanced Level** (40%): Complex challenges involving data structures, optimization, and system design ### Programming Domains Covered - Algorithms and Data Structures - Web Development and Frameworks - Machine Learning and AI Implementations - System Programming and Operating Systems - Database Management and SQL/NoSQL - Software Engineering Best Practices - Competitive Programming Problems > Note: Domains are for reference only. The actual data is very diverse and covers more domains than stated. The actual data includes more complex and high-level questions than stated, spanning multiple programming languages such as Python, Java, C++, JavaScript, and others. ## Use Cases - **Fine-tuning** code generation and reasoning capabilities in language models - **Training** instruction-following models with a coding and reasoning focus - **Benchmarking** model performance on coding tasks, problem-solving, and logical reasoning - **Research** in AI-assisted programming, automated code completion, and explainable AI - **Educational** applications requiring step-by-step code explanations and reasoning ## Dataset Format Each example contains: - **Problem Statement**: Clear coding challenge or task description - **Step-by-Step Solution**: Detailed reasoning process - **Code Solution**: Final executable code with integrated reasoning ## Quality Assurance - **Automated Verification**: All code solutions verified using execution environments and testing suites - **Correctness Guarantee**: Only problems with verified correct and functional code are included - **Human Review**: Sample validation by coding experts - **Automated Checks**: Static analysis, linting, and runtime verification where applicable - **Open and Closed-Source Filtering**: Integration of proprietary and community-driven tools for enhanced quality control ## Performance Metrics Models trained on this dataset show significant improvements in: - Code generation accuracy with reasoning - Efficiency in producing detailed, step-by-step solutions - Problem-solving speed and logical coherence - Cross-language and cross-domain code transfer - Reduction in hallucinated or erroneous code outputs through better reasoning ## Acknowledgments Special thanks to our partners and contributors: - **NVIDIA** - Reference datasets; CodeX contains many examples taken from NVIDIA's existing datasets - **Modotte Team** - Dataset curation, quality assurance, along with customly generated examples ## Citation **Anyone** can freely use and modify this dataset. ## License This dataset is released under [apache-2.0]. ```bibtex @dataset{codex2024, title={CodeX-2M-Thinking: Large-Scale Coding Dataset with Reasoning}, author={Parvesh Rawal at Modotte}, year={2024}, publisher={Modotte}, url={https://huggingface.co/datasets/Modotte/CodeX-2M-Thinking} } ``` ## Contact For questions, suggestions, or collaboration opportunities: - **Email**: [Modotte](team@modotte.com) - **Twitter**: [@Modotte] - **GitHub**: [Modotte] --- *Built with ❤️ by Modotte - Advancing AI through high-quality data*

# XenArcAI --- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png" alt="CodeX 横幅" width="70%" style="border-radius:15px;" /> </p> > 注:本数据集是XenArcAI推出的CodeX系列的组成部分。您可在该系列中获取多款数据集,其核心目标是为模型训练与微调提供极高质量的数据集。 本数据集完全由合成数据构建,从优质公开数据源精选而来,并结合闭源与开源大语言模型(Large Language Model)生成的合成数据进行增强。它是面向指令式模型微调的坚实基础,提供了当前可用的、针对带推理的编码任务的最精细且最全面的语料库之一。 ### 核心特性 - **规模**:200万条经过精心筛选的高质量编码数据样本 - **多样性**:全面覆盖从基础语法到高级软件工程的各类编程领域 - **质量**:采用多阶段过滤与验证流程,包括基于排序的筛选与专家遴选 - **思维聚焦**:响应中包含逐步推理过程,针对带有详细思考流程的指令训练进行了优化 - **准确性**:通过自动化测试框架验证代码执行与正确性 ## 数据集概览 **CodeX-2M-Thinking** 是一款经过精心整理的编码数据集,专为提升模型代码生成与推理能力的指令式模型微调而设计。这款完全合成的数据集是Hugging Face(拥抱脸)平台上规模庞大且经过全面过滤的编码语料库,强调通过逐步推理的思维方式实现更深入的模型训练。 ## 使用方法 bash pip install -U datasets fsspec python from datasets import load_dataset dataset = load_dataset("XenArcAI/CodeX-2M-Thinking") ### 核心特性 - **规模**:200万条经过精心筛选的高质量编码数据样本 - **多样性**:全面覆盖从基础语法到高级软件工程的各类编程领域 - **质量**:采用多阶段过滤与验证流程,包括基于排序的筛选与专家遴选 - **思维聚焦**:响应中包含逐步推理过程,针对带有详细思考流程的指令训练进行了优化 - **准确性**:通过自动化测试框架验证代码执行与正确性 ## 数据整理流程 本数据集通过完全合成的方式精心构建,通过选择性生成与合并样本以丰富面向生成模型的整体数据集。 ### 数据来源 - **优质现有数据集**:从线上多个优质编码数据集精选而来(例如来自NVIDIA与XenArcAI内部馆藏的数据集) - **合成生成**:通过闭源与开源大语言模型(Large Language Model)完全生成 - **专家验证**:由XenArcAI团队对代码解决方案、推理过程与实现进行人工审核 ### 过滤流程 我们的严格过滤流程结合了开源与闭源过滤技术,仅保留最高质量的样本: 1. **去重**:移除重复的问题与代码解决方案 2. **标准化**:统一代码格式并清理语法问题 3. **停用词处理**:智能移除非必要注释或样板代码 4. **质量评分**:通过代码复杂度、可读性与效率等指标进行多维度质量评估 5. **基于排序的筛选**:采用高级排序算法,根据相关性、新颖性与实用性优先筛选顶级样本 6. **专家遴选**:由编程专家进行手动整理,挑选优质示例 7. **答案验证**:使用pytest等测试框架进行自动化测试与执行验证 8. **内容过滤**:移除不当、过时或错误的代码 9. **多样性平衡**:通过算法采样确保各编程语言与领域的数据分布均衡 ### 问题复杂度分布 - **基础级(30%)**:基础编程概念、简单语法与基础操作 - **进阶级(30%)**:需要模块化代码与基础算法的多功能问题 - **高级(40%)**:涉及数据结构、优化与系统设计的复杂挑战 ### 覆盖的编程领域 - 算法与数据结构 - Web开发与框架 - 机器学习与AI实现 - 系统编程与操作系统 - 数据库管理与SQL/NoSQL - 软件工程最佳实践 - 竞赛编程问题 > 注:上述领域仅作参考,实际数据多样性极强,覆盖的领域远超上述范围。实际数据包含比描述更复杂的高阶问题,支持Python、Java、C++、JavaScript等多种编程语言。 ## 应用场景 - **微调**:语言模型的代码生成与推理能力 - **训练**:以编码与推理为核心的指令跟随模型 - **基准测试**:模型在编码任务、问题解决与逻辑推理方面的性能 - **研究**:AI辅助编程、自动代码补全与可解释AI领域的研究 - **教育**:需要逐步代码解释与推理过程的教学应用 ## 数据集格式 每个样本包含: - **问题描述**:清晰的编码挑战或任务说明 - **逐步解决方案**:详细的推理过程 - **代码解决方案**:集成了推理过程的最终可执行代码 ## 质量保障 - **自动化验证**:所有代码解决方案均通过执行环境与测试套件进行验证 - **正确性保障**:仅收录经过验证的正确可用代码的问题 - **人工审核**:由编程专家对样本进行验证 - **自动化检查**:包括静态分析、代码检查与运行时验证(如适用) - **开源与闭源过滤**:整合专有工具与社区驱动工具以强化质量管控 ## 性能指标 基于本数据集训练的模型在以下方面表现出显著提升: - 带推理的代码生成准确性 - 生成详细逐步解决方案的效率 - 问题解决速度与逻辑连贯性 - 跨语言与跨领域的代码迁移能力 - 通过更优推理减少幻觉或错误代码输出 ## 致谢 特别感谢我们的合作伙伴与贡献者: - **NVIDIA**:提供参考数据集;CodeX包含大量从NVIDIA现有数据集提取的示例 - **XenArcAI团队**:数据集整理、质量保障与定制生成的示例 ## 引用 **任何个人或机构**均可自由使用与修改本数据集。 ## 许可证 本数据集采用 [apache-2.0] 许可证发布。 bibtex @dataset{codex2024, title={CodeX-2M-Thinking: Large-Scale Coding Dataset with Reasoning}, author={Parvesh at XenArcAI}, year={2024}, publisher={XenArcAI}, url={https://huggingface.co/datasets/XenArcAI/CodeX-2M-Thinking} } ## 联系方式 如有疑问、建议或合作意向: - **邮箱**:[XenArcAI](team@xenarcai.com) - **Twitter**:[@XenArcAI] - **GitHub**:[XenArcAI] --- *由XenArcAI倾心打造——以高质量数据推动AI发展*
提供机构:
maas
创建时间:
2025-11-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作