TheFinAI/ibm-project-codenet
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheFinAI/ibm-project-codenet
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Source
dtype: string
- name: Date
dtype: int64
- name: Text
dtype: string
- name: Token_count
dtype: int64
splits:
- name: train
num_bytes: 8122744210
num_examples: 6366648
download_size: 3707767805
dataset_size: 8122744210
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
pretty_name: Project_CodeNet
size_categories:
- 1M<n<10M
task_categories:
- text-generation
language:
- code
license: other
---
# Project_CodeNet
## Overview
This dataset is constructed from the **Project CodeNet** corpus, consisting of competitive programming submissions collected from online judges.
We extract a large-scale code corpus designed for pretraining language models, with a focus on:
- clean executable code
- temporal metadata (submission time)
- minimal preprocessing to preserve the original distribution
---
## Dataset Statistics
- **Total samples:** ~6.37M
- **Total tokens:** ~3.06B
- **Average tokens per sample:** 480.44
### Token Length Distribution
- P50: 162 tokens
- P90: 679 tokens
- P95: 1035 tokens
- P99: 2702 tokens
---
## Construction
### Source
- Project CodeNet https://github.com/IBM/Project_CodeNet
### Filtering Rules
We apply the following steps:
1. **Keep only Accepted submissions**
- Removes incorrect or incomplete code.
2. **Deduplication at metadata level**
- For each `(problem_id, user_id, language)`, keep the **last accepted submission**
- This approximates the user's final solution
3. **No content-based deduplication**
- Similar solutions across users are preserved
- Reflects real-world submission distribution
4. **No balancing**
- Language and temporal distributions are kept as-is
---
## Fields
Each sample contains:
| Field | Description |
|------|------------|
| `Source` | Dataset name (`Project_CodeNet`) |
| `Date` | Submission year |
| `Text` | Source code |
| `Token_count` | Token count computed using `tiktoken` |
---
## Tokenization
- Tokenizer: `tiktoken`
- Encoding: `cl100k_base`
---
## Distribution Characteristics
### Language Distribution
The dataset is highly skewed toward C++:
- C++ dominates (~60%)
- Python is the second largest (~23%)
- Other languages form a long tail
### Temporal Distribution
The dataset is heavily concentrated in recent years:
- Majority of samples from **2019–2020**
- Reflects real submission activity in CodeNet
---
## Important Notes
- This dataset preserves the **original submission distribution** of CodeNet.
- It is **not balanced** across languages or time.
- It is primarily composed of **competitive programming code**, which may differ from production software code.
- Some level of **near-duplicate solutions** exists due to similar problem-solving strategies.
---
## Intended Use
- Pretraining code language models
- Studying temporal evolution of programming patterns
- Benchmarking under real-world distribution settings
---
## Limitations
- Not representative of general software engineering code
- Strong bias toward:
- competitive programming tasks
- algorithmic problem solving
- Language and temporal imbalance
---
## License
Please refer to the original **Project CodeNet** dataset for licensing details.
---
## Citation
If you use this dataset, please cite Project CodeNet:
@article{puri2021project,
title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks},
author={Puri, Ruchir and others},
year={2021}
}
dataset_info: 数据集信息:
features:
- name: Source
dtype: string
- name: Date
dtype: int64
- name: Text
dtype: string
- name: Token_count
dtype: int64
splits:
- name: train
num_bytes: 8122744210
num_examples: 6366648
download_size: 3707767805
dataset_size: 8122744210
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
pretty_name: Project_CodeNet
size_categories:
- 1M<n<10M 即样本量介于100万至1000万之间
task_categories:
- 文本生成(text-generation)
language:
- 代码(code)
license: 其他
# Project_CodeNet
## 概述
本数据集源自**Project CodeNet**语料库,包含从在线评判平台收集的竞赛编程提交代码。
我们构建了适用于大语言模型(Large Language Model)预训练的大规模代码语料库,重点关注以下几点:
- 可正常运行的纯净代码
- 时序元数据(提交时间)
- 最小化预处理以保留原始数据分布
---
## 数据集统计
- **总样本量**:约637万
- **总Token数**:约30.6亿
- **单样本平均Token数**:480.44
### Token长度分布
- P50分位数:162个Token
- P90分位数:679个Token
- P95分位数:1035个Token
- P99分位数:2702个Token
---
## 数据集构建
### 数据来源
- Project CodeNet:https://github.com/IBM/Project_CodeNet
### 过滤规则
我们采用以下处理步骤:
1. **仅保留已通过评判的提交代码**
- 移除错误或不完整的代码。
2. **元数据级别去重**
- 针对每个`(题目ID, 用户ID, 编程语言)`组合,保留**最后一次通过的提交代码**
- 这可近似反映用户的最终解题方案
3. **不进行基于内容的去重**
- 保留不同用户提交的相似解题代码
- 贴合真实的提交数据分布
4. **不进行数据平衡**
- 保留原始的语言和时序数据分布
---
## 字段说明
每个样本包含以下字段:
| 字段 | 说明 |
|------|------------|
| `Source` | 数据集名称(固定为`Project_CodeNet`) |
| `Date` | 代码提交年份 |
| `Text` | 源代码文本 |
| `Token_count` | 使用`tiktoken`计算得到的Token数量 |
---
## 分词方式
- 分词器:`tiktoken`
- 编码格式:`cl100k_base`
---
## 分布特征
### 语言分布
本数据集的语言分布高度偏向C++:
- C++占比最高(约60%)
- Python占比次之(约23%)
- 其余语言构成长尾分布
### 时序分布
本数据集的样本高度集中于近年:
- 大部分样本来自**2019年至2020年**
- 贴合Project CodeNet平台的真实提交活动规律
---
## 重要说明
- 本数据集完整保留了Project CodeNet的**原始提交数据分布**。
- 数据集在语言和时间维度上均**未做平衡处理**。
- 数据集主体为**竞赛编程代码**,与工业生产级软件代码存在差异。
- 由于解题思路相似,数据集存在一定程度的**近似重复代码**。
---
## 预期用途
- 用于代码大语言模型预训练
- 用于研究编程模式的时序演化规律
- 用于在真实数据分布场景下开展模型基准测试
---
## 局限性
- 无法代表通用软件工程代码场景
- 存在显著偏向性:
- 竞赛编程任务
- 算法解题场景
- 语言和时序分布不平衡
---
## 许可协议
许可协议详情请参阅原始**Project CodeNet**数据集文档。
---
## 引用规范
若您使用本数据集,请引用Project CodeNet:
@article{puri2021project,
title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks},
author={Puri, Ruchir and others},
year={2021}
}
提供机构:
TheFinAI



