CodeX-7M-Non-Thinking
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/XenArcAI/CodeX-7M-Non-Thinking
下载链接
链接失效反馈官方服务:
资源简介:
# XenArcAI
---
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png"
alt="CodeX Banner"
width="70%"
style="border-radius:15px;"
/>
> Note: This dataset is part of the lineup CodeX by XenArcAI. You can get lots of datasets in this same lineup, with the main focus on providing very high-quality datasets for model training and fine-tuning.
This dataset is curated from high-quality public sources and enhanced with synthetic data from both closed and open-source models. It serves as a strong foundation for instruction-based model tuning and fine-tuning, offering one of the most refined and extensive corpora available for coding tasks.
### Key Features
- **Scale**: 7 million examples of highly curated coding data
- **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering
- **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections
- **Non-Thinking Focus**: Direct code solutions without step-by-step reasoning chains, optimized for efficient instruction training
- **Accuracy**: Verified code executions and correctness validation using automated testing frameworks
## Dataset Overview
**CodeX-7M-Non-Thinking** is a meticulously curated coding dataset designed specifically for instruction-based model tuning and fine-tuning of existing models with enhanced code generation capabilities. This represents one of the largest and most comprehensively filtered corpora of publicly available coding data on the Hugging Face platform, with a non-thinking approach that emphasizes direct, concise code outputs for rapid model training.
## How to Use?
```bash
pip install -U datasets fsspec
```
```python
from datasets import load_dataset
dataset = load_dataset("XenArcAI/CodeX-7M-Non-Thinking")
```
### Key Features
- **Scale**: 7 million examples of highly curated coding data
- **Diversity**: Comprehensive coverage of programming domains from basic syntax to advanced software engineering
- **Quality**: Multi-stage filtering and verification processes, including ranking-based filtering and expert selections
- **Non-Thinking Focus**: Direct code solutions without step-by-step reasoning chains, optimized for efficient instruction training
- **Accuracy**: Verified code executions and correctness validation using automated testing frameworks
## Data Curation Process
This dataset has been carefully constructed through a multi-source approach, selectively collecting and merging examples from premium sources, along with customly generated examples to enrich the overall dataset for generation models.
### Data Sources
- **High-Quality Existing Datasets**: Curated from multiple premium coding datasets available online (e.g., from NVIDIA, OpenAI-inspired repositories, and XenArcAI's internal collections)
- **Synthetic Generation**: Generated using both closed-source and open-source language models (XenArcAI)
- **Expert Validation**: Human-verified code solutions and implementations (XenArcAI)
### Filtering Pipeline
Our rigorous filtering process includes open and closed-source filtering techniques, ensuring only the highest-quality examples are retained:
1. **Deduplication**: Removal of duplicate problems and code solutions
2. **Normalization**: Code formatting standardization and syntax cleanup
3. **Stopword Processing**: Intelligent removal of non-essential comments or boilerplate
4. **Quality Scoring**: Multi-dimensional quality assessment using metrics like code complexity, readability, and efficiency
5. **Ranking-Based Filtering**: Advanced ranking algorithms to prioritize top-tier examples based on relevance, novelty, and utility
6. **Expert Selections**: Manual curation by coding experts to select exemplary samples
7. **Answer Verification**: Automated testing and execution validation using frameworks like pytest or unit tests
8. **Content Filtering**: Removal of inappropriate, outdated, or incorrect code
9. **Diversity Balancing**: Ensuring balanced representation across languages and domains through algorithmic sampling
### Problem Complexity Distribution
- **Basic Level** (30%): Fundamental programming concepts, simple syntax, and basic operations
- **Intermediate Level** (30%): Multi-function problems requiring modular code and basic algorithms
- **Advanced Level** (40%): Complex challenges involving data structures, optimization, and system design
### Programming Domains Covered
- Algorithms and Data Structures
- Web Development and Frameworks
- Machine Learning and AI Implementations
- System Programming and Operating Systems
- Database Management and SQL/NoSQL
- Software Engineering Best Practices
- Competitive Programming Problems
> Note: Domains are for reference only. The actual data is very diverse and covers more domains than stated. The actual data includes more complex and high-level questions than stated, spanning multiple programming languages such as Python, Java, C++, JavaScript, and others.
## Use Cases
- **Fine-tuning** code generation capabilities in language models
- **Training** instruction-following models with a coding focus
- **Benchmarking** model performance on coding tasks and problem-solving
- **Research** in AI-assisted programming and automated code completion
- **Educational** applications requiring direct code examples and solutions
## Dataset Format
Each example contains:
- **Problem Statement**: Clear coding challenge or task description
- **Code Solution**: Direct, response without intermediate reasoning
## Quality Assurance
- **Automated Verification**: All code solutions verified using execution environments and testing suites
- **Correctness Guarantee**: Only problems with verified correct and functional code are included
- **Human Review**: Sample validation by coding experts
- **Automated Checks**: Static analysis, linting, and runtime verification where applicable
- **Open and Closed-Source Filtering**: Integration of proprietary and community-driven tools for enhanced quality control
## Performance Metrics
Models trained on this dataset show significant improvements in:
- Code generation accuracy
- Efficiency in producing concise solutions
- Problem-solving speed
- Cross-language and cross-domain code transfer
- Reduction in hallucinated or erroneous code outputs
## Acknowledgments
Special thanks to our partners and contributors:
- **NVIDIA, Magpie-Align, Magpie-Align** - Reference datasets; CodeX contains many examples taken from their existing datasets
- **Microsoft** - Inspirational datasets and methodologies; CodeX includes adapted examples from Microsft-related repositories
- **XenArcAI Team** - Dataset curation, quality assurance, along with customly generated examples
## Citation
**Anyone** can freely use and modify this dataset.
## License
This dataset is released under [apache-2.0].
```bibtex
@dataset{codex2024,
title={CodeX-7M-Non-Thinking: Large-Scale Coding Dataset},
author={Parvesh at XenArcAI},
year={2024},
publisher={XenArcAI},
url={https://huggingface.co/datasets/XenArcAI/CodeX-7M-Non-Thinking}
}
```
## Contact
For questions, suggestions, or collaboration opportunities:
- **Email**: [XenArcAI](team@xenarcai.com)
- **Twitter**: [@XenArcAI]
- **GitHub**: [XenArcAI]
---
*Built with ❤️ by XenArcAI - Advancing AI through high-quality data*
# XenArcAI
---
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/ZP4YDDIRewH5M-jKmE4Rt.png"
alt="CodeX 横幅"
width="70%"
style="border-radius:15px;"
/>
> 注:本数据集隶属于XenArcAI旗下的CodeX系列数据集产品线。该系列汇聚多款优质数据集,核心定位为模型训练与微调提供高标准的高质量数据资源。
本数据集从优质公开数据源精选整理,并结合闭源与开源模型生成的合成数据进行增强优化。其可为基于指令的模型微调提供坚实基础,是目前面向编码任务的最精细、最全面的语料库之一。
### 核心特性
- **规模**:700万条经过精细筛选的编码数据样本
- **多样性**:覆盖从基础语法到高级软件工程的全编程领域
- **质量**:采用多阶段过滤与验证流程,包含基于排序的筛选机制与专家人工遴选
- **非思考式聚焦**:直接提供代码解决方案,无需逐步推理过程,专为高效的指令训练优化
- **准确性**:通过自动化测试框架对代码执行与正确性进行验证
## 数据集概览
**CodeX-7M-Non-Thinking** 是一款精心打造的编码数据集,专为基于指令的模型微调以及提升现有模型的代码生成能力而设计。它是拥抱脸(Hugging Face)平台上公开可用的规模最大、筛选最全面的编码语料库之一,采用“非思考式”设计理念,强调直接、简洁的代码输出,以适配快速模型训练需求。
## 使用方法
bash
pip install -U datasets fsspec
python
from datasets import load_dataset
dataset = load_dataset("XenArcAI/CodeX-7M-Non-Thinking")
### 核心特性
- **规模**:700万条经过精细筛选的编码数据样本
- **多样性**:覆盖从基础语法到高级软件工程的全编程领域
- **质量**:采用多阶段过滤与验证流程,包含基于排序的筛选机制与专家人工遴选
- **非思考式聚焦**:直接提供代码解决方案,无需逐步推理过程,专为高效的指令训练优化
- **准确性**:通过自动化测试框架对代码执行与正确性进行验证
## 数据构建流程
本数据集通过多源方式精心构建,从优质数据源中精选并合并样本,同时结合自定义生成的样本来丰富数据集,以适配生成模型的训练需求。
### 数据来源
- **高质量现有数据集**:从线上多款优质编码数据集精选整理(例如来自NVIDIA、受OpenAI启发的开源仓库,以及XenArcAI内部数据集集合)
- **合成数据生成**:使用闭源与开源大语言模型(Large Language Model)生成
- **专家验证**:由编码专家人工核验代码解决方案与实现
### 过滤流程
我们采用严格的过滤流程,结合开源与闭源技术手段,确保仅保留最高质量的样本:
1. **去重**:移除重复的问题与代码解决方案
2. **标准化**:统一代码格式并清理语法错误
3. **冗余处理**:智能移除非必要的注释与样板代码
4. **质量评分**:通过代码复杂度、可读性与效率等多维度指标进行质量评估
5. **基于排序的筛选**:使用高级排序算法,根据相关性、新颖性与实用性优先筛选优质样本
6. **专家遴选**:由编码专家手动精选典型样本
7. **答案验证**:使用pytest等自动化测试框架进行代码执行与有效性验证
8. **内容过滤**:移除不当、过时或存在错误的代码
9. **多样性平衡**:通过算法采样确保各编程语言与领域的数据分布均衡
### 问题难度分布
- **基础级别(30%)**:基础编程概念、简单语法与基础操作
- **中级级别(30%)**:需要模块化代码与基础算法的多函数问题
- **高级级别(40%)**:涉及数据结构、优化与系统设计的复杂挑战
### 覆盖的编程领域
- 算法与数据结构
- Web开发与框架
- 机器学习与AI实现
- 系统编程与操作系统
- 数据库管理与SQL/NoSQL
- 软件工程最佳实践
- 竞赛编程题目
> 注:上述领域仅作参考,实际数据覆盖范围更广,包含更多未列出的领域。实际数据还涵盖更多复杂高阶问题,支持Python、Java、C++、JavaScript等多种编程语言。
## 应用场景
- **微调**:语言模型的代码生成能力微调
- **训练**:面向编码任务的指令跟随模型训练
- **基准测试**:评估模型在编码任务与问题解决中的性能
- **研究**:AI辅助编程与自动代码补全相关研究
- **教育**:需要直接代码示例与解决方案的教学场景
## 数据集格式
每条样本包含:
- **问题描述**:清晰的编码挑战或任务说明
- **代码解决方案**:直接的代码响应,不含中间推理过程
## 质量保障
- **自动化验证**:所有代码解决方案均通过执行环境与测试套件验证
- **正确性保障**:仅收录经验证具备正确可执行代码的问题
- **人工复核**:由编码专家对样本进行抽样验证
- **自动化检查**:按需执行静态分析、代码 lint 与运行时验证
- **开源与闭源过滤结合**:整合专有工具与社区驱动工具,强化质量管控
## 性能表现
使用本数据集训练的模型在以下方面表现出显著提升:
- 代码生成准确性
- 生成简洁解决方案的效率
- 问题解决速度
- 跨语言与跨领域的代码迁移能力
- 幻觉或错误代码输出的占比降低
## 致谢
特别感谢以下合作伙伴与贡献者:
- **NVIDIA、Magpie-Align、Magpie-Align**:提供参考数据集;CodeX数据集包含大量取自其现有数据集的样本
- **Microsoft**:提供灵感来源的数据集与方法论;CodeX数据集包含适配自微软相关仓库的示例
- **XenArcAI团队**:负责数据集整理、质量保障以及自定义生成样本的制作
## 引用声明
任何个人或组织均可自由使用与修改本数据集。
## 许可证
本数据集采用 [apache-2.0] 许可证发布。
bibtex
@dataset{codex2024,
title={CodeX-7M-Non-Thinking: Large-Scale Coding Dataset},
author={Parvesh at XenArcAI},
year={2024},
publisher={XenArcAI},
url={https://huggingface.co/datasets/XenArcAI/CodeX-7M-Non-Thinking}
}
## 联系方式
如有疑问、建议或合作意向,请通过以下渠道联系:
- **邮箱**:[XenArcAI](team@xenarcai.com)
- **Twitter**:[@XenArcAI]
- **GitHub**:[XenArcAI]
---
*由XenArcAI倾心打造——以高质量数据推动AI发展*
提供机构:
maas
创建时间:
2025-11-17



