CodeX-2M-Thinking
收藏魔搭社区2026-05-15 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/XenArcAI/CodeX-2M-Thinking
下载链接
链接失效反馈官方服务:
资源简介:
# Modotte
---
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png"
alt="CodeX Banner"
width="70%"
style="border-radius:15px;"
/>
> Note: This dataset is part of the lineup CodeX by Modotte. You can get lots of datasets in this same lineup, with the main focus on providing very high-quality datasets for model training and fine-tuning.
This dataset is fully synthetic, curated from high-quality public sources and enhanced with synthetic data generated using both closed and open-source models. It serves as a strong foundation for instruction-based model tuning and fine-tuning, offering one of the most refined and extensive corpora available for coding tasks with reasoning.
### Key Features
- **Scale**: 2 million examples of highly curated coding data
- **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering
- **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections
- **Thinking Focus**: Step-by-step reasoning included in responses, optimized for instruction training with detailed thought processes
- **Accuracy**: Verified code executions and correctness validation using automated testing frameworks
## Dataset Overview
**CodeX-2M-Thinking** is a meticulously curated coding dataset designed specifically for instruction-based model tuning and fine-tuning of existing models with enhanced code generation and reasoning capabilities. This fully synthetic dataset represents a large and comprehensively filtered corpus of coding data on the Hugging Face platform, emphasizing a thinking approach with step-by-step reasoning for deeper model training.
## How to Use?
```bash
pip install -U datasets fsspec
```
```python
from datasets import load_dataset
dataset = load_dataset("Modotte/CodeX-2M-Thinking")
```
### Key Features
- **Scale**: 2 million examples of highly curated coding data
- **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering
- **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections
- **Thinking Focus**: Step-by-step reasoning included in responses, optimized for instruction training with detailed thought processes
- **Accuracy**: Verified code executions and correctness validation using automated testing frameworks
## Data Curation Process
This dataset has been carefully constructed through a fully synthetic approach, selectively generating and merging examples to enrich the overall dataset for generation models.
### Data Sources
- **High-Quality Existing Datasets**: Curated from multiple premium coding datasets available online (e.g., from NVIDIA and Modotte's internal collections)
- **Synthetic Generation**: Fully generated using both closed-source and open-source language models (Modotte)
- **Expert Validation**: Human-verified code solutions, reasoning, and implementations (Modotte)
### Filtering Pipeline
Our rigorous filtering process includes open and closed-source filtering techniques, ensuring only the highest-quality examples are retained:
1. **Deduplication**: Removal of duplicate problems and code solutions
2. **Normalization**: Code formatting standardization and syntax cleanup
3. **Stopword Processing**: Intelligent removal of non-essential comments or boilerplate
4. **Quality Scoring**: Multi-dimensional quality assessment using metrics like code complexity, readability, and efficiency
5. **Ranking-Based Filtering**: Advanced ranking algorithms to prioritize top-tier examples based on relevance, novelty, and utility
6. **Expert Selections**: Manual curation by coding experts to select exemplary samples
7. **Answer Verification**: Automated testing and execution validation using frameworks like pytest or unit tests
8. **Content Filtering**: Removal of inappropriate, outdated, or incorrect code
9. **Diversity Balancing**: Ensuring balanced representation across languages and domains through algorithmic sampling
### Problem Complexity Distribution
- **Basic Level** (30%): Fundamental programming concepts, simple syntax, and basic operations
- **Intermediate Level** (30%): Multi-function problems requiring modular code and basic algorithms
- **Advanced Level** (40%): Complex challenges involving data structures, optimization, and system design
### Programming Domains Covered
- Algorithms and Data Structures
- Web Development and Frameworks
- Machine Learning and AI Implementations
- System Programming and Operating Systems
- Database Management and SQL/NoSQL
- Software Engineering Best Practices
- Competitive Programming Problems
> Note: Domains are for reference only. The actual data is very diverse and covers more domains than stated. The actual data includes more complex and high-level questions than stated, spanning multiple programming languages such as Python, Java, C++, JavaScript, and others.
## Use Cases
- **Fine-tuning** code generation and reasoning capabilities in language models
- **Training** instruction-following models with a coding and reasoning focus
- **Benchmarking** model performance on coding tasks, problem-solving, and logical reasoning
- **Research** in AI-assisted programming, automated code completion, and explainable AI
- **Educational** applications requiring step-by-step code explanations and reasoning
## Dataset Format
Each example contains:
- **Problem Statement**: Clear coding challenge or task description
- **Step-by-Step Solution**: Detailed reasoning process
- **Code Solution**: Final executable code with integrated reasoning
## Quality Assurance
- **Automated Verification**: All code solutions verified using execution environments and testing suites
- **Correctness Guarantee**: Only problems with verified correct and functional code are included
- **Human Review**: Sample validation by coding experts
- **Automated Checks**: Static analysis, linting, and runtime verification where applicable
- **Open and Closed-Source Filtering**: Integration of proprietary and community-driven tools for enhanced quality control
## Performance Metrics
Models trained on this dataset show significant improvements in:
- Code generation accuracy with reasoning
- Efficiency in producing detailed, step-by-step solutions
- Problem-solving speed and logical coherence
- Cross-language and cross-domain code transfer
- Reduction in hallucinated or erroneous code outputs through better reasoning
## Acknowledgments
Special thanks to our partners and contributors:
- **NVIDIA** - Reference datasets; CodeX contains many examples taken from NVIDIA's existing datasets
- **Modotte Team** - Dataset curation, quality assurance, along with customly generated examples
## Citation
**Anyone** can freely use and modify this dataset.
## License
This dataset is released under [apache-2.0].
```bibtex
@dataset{codex2024,
title={CodeX-2M-Thinking: Large-Scale Coding Dataset with Reasoning},
author={Parvesh Rawal at Modotte},
year={2024},
publisher={Modotte},
url={https://huggingface.co/datasets/Modotte/CodeX-2M-Thinking}
}
```
## Contact
For questions, suggestions, or collaboration opportunities:
- **Email**: [Modotte](team@modotte.com)
- **Twitter**: [@Modotte]
- **GitHub**: [Modotte]
---
*Built with ❤️ by Modotte - Advancing AI through high-quality data*
# XenArcAI
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png" alt="CodeX 横幅" width="70%" style="border-radius:15px;" />
</p>
> 注:本数据集是XenArcAI推出的CodeX系列的组成部分。您可在该系列中获取多款数据集,其核心目标是为模型训练与微调提供极高质量的数据集。
本数据集完全由合成数据构建,从优质公开数据源精选而来,并结合闭源与开源大语言模型(Large Language Model)生成的合成数据进行增强。它是面向指令式模型微调的坚实基础,提供了当前可用的、针对带推理的编码任务的最精细且最全面的语料库之一。
### 核心特性
- **规模**:200万条经过精心筛选的高质量编码数据样本
- **多样性**:全面覆盖从基础语法到高级软件工程的各类编程领域
- **质量**:采用多阶段过滤与验证流程,包括基于排序的筛选与专家遴选
- **思维聚焦**:响应中包含逐步推理过程,针对带有详细思考流程的指令训练进行了优化
- **准确性**:通过自动化测试框架验证代码执行与正确性
## 数据集概览
**CodeX-2M-Thinking** 是一款经过精心整理的编码数据集,专为提升模型代码生成与推理能力的指令式模型微调而设计。这款完全合成的数据集是Hugging Face(拥抱脸)平台上规模庞大且经过全面过滤的编码语料库,强调通过逐步推理的思维方式实现更深入的模型训练。
## 使用方法
bash
pip install -U datasets fsspec
python
from datasets import load_dataset
dataset = load_dataset("XenArcAI/CodeX-2M-Thinking")
### 核心特性
- **规模**:200万条经过精心筛选的高质量编码数据样本
- **多样性**:全面覆盖从基础语法到高级软件工程的各类编程领域
- **质量**:采用多阶段过滤与验证流程,包括基于排序的筛选与专家遴选
- **思维聚焦**:响应中包含逐步推理过程,针对带有详细思考流程的指令训练进行了优化
- **准确性**:通过自动化测试框架验证代码执行与正确性
## 数据整理流程
本数据集通过完全合成的方式精心构建,通过选择性生成与合并样本以丰富面向生成模型的整体数据集。
### 数据来源
- **优质现有数据集**:从线上多个优质编码数据集精选而来(例如来自NVIDIA与XenArcAI内部馆藏的数据集)
- **合成生成**:通过闭源与开源大语言模型(Large Language Model)完全生成
- **专家验证**:由XenArcAI团队对代码解决方案、推理过程与实现进行人工审核
### 过滤流程
我们的严格过滤流程结合了开源与闭源过滤技术,仅保留最高质量的样本:
1. **去重**:移除重复的问题与代码解决方案
2. **标准化**:统一代码格式并清理语法问题
3. **停用词处理**:智能移除非必要注释或样板代码
4. **质量评分**:通过代码复杂度、可读性与效率等指标进行多维度质量评估
5. **基于排序的筛选**:采用高级排序算法,根据相关性、新颖性与实用性优先筛选顶级样本
6. **专家遴选**:由编程专家进行手动整理,挑选优质示例
7. **答案验证**:使用pytest等测试框架进行自动化测试与执行验证
8. **内容过滤**:移除不当、过时或错误的代码
9. **多样性平衡**:通过算法采样确保各编程语言与领域的数据分布均衡
### 问题复杂度分布
- **基础级(30%)**:基础编程概念、简单语法与基础操作
- **进阶级(30%)**:需要模块化代码与基础算法的多功能问题
- **高级(40%)**:涉及数据结构、优化与系统设计的复杂挑战
### 覆盖的编程领域
- 算法与数据结构
- Web开发与框架
- 机器学习与AI实现
- 系统编程与操作系统
- 数据库管理与SQL/NoSQL
- 软件工程最佳实践
- 竞赛编程问题
> 注:上述领域仅作参考,实际数据多样性极强,覆盖的领域远超上述范围。实际数据包含比描述更复杂的高阶问题,支持Python、Java、C++、JavaScript等多种编程语言。
## 应用场景
- **微调**:语言模型的代码生成与推理能力
- **训练**:以编码与推理为核心的指令跟随模型
- **基准测试**:模型在编码任务、问题解决与逻辑推理方面的性能
- **研究**:AI辅助编程、自动代码补全与可解释AI领域的研究
- **教育**:需要逐步代码解释与推理过程的教学应用
## 数据集格式
每个样本包含:
- **问题描述**:清晰的编码挑战或任务说明
- **逐步解决方案**:详细的推理过程
- **代码解决方案**:集成了推理过程的最终可执行代码
## 质量保障
- **自动化验证**:所有代码解决方案均通过执行环境与测试套件进行验证
- **正确性保障**:仅收录经过验证的正确可用代码的问题
- **人工审核**:由编程专家对样本进行验证
- **自动化检查**:包括静态分析、代码检查与运行时验证(如适用)
- **开源与闭源过滤**:整合专有工具与社区驱动工具以强化质量管控
## 性能指标
基于本数据集训练的模型在以下方面表现出显著提升:
- 带推理的代码生成准确性
- 生成详细逐步解决方案的效率
- 问题解决速度与逻辑连贯性
- 跨语言与跨领域的代码迁移能力
- 通过更优推理减少幻觉或错误代码输出
## 致谢
特别感谢我们的合作伙伴与贡献者:
- **NVIDIA**:提供参考数据集;CodeX包含大量从NVIDIA现有数据集提取的示例
- **XenArcAI团队**:数据集整理、质量保障与定制生成的示例
## 引用
**任何个人或机构**均可自由使用与修改本数据集。
## 许可证
本数据集采用 [apache-2.0] 许可证发布。
bibtex
@dataset{codex2024,
title={CodeX-2M-Thinking: Large-Scale Coding Dataset with Reasoning},
author={Parvesh at XenArcAI},
year={2024},
publisher={XenArcAI},
url={https://huggingface.co/datasets/XenArcAI/CodeX-2M-Thinking}
}
## 联系方式
如有疑问、建议或合作意向:
- **邮箱**:[XenArcAI](team@xenarcai.com)
- **Twitter**:[@XenArcAI]
- **GitHub**:[XenArcAI]
---
*由XenArcAI倾心打造——以高质量数据推动AI发展*
提供机构:
maas
创建时间:
2025-11-17



