TheFinAI/ibm-project-codenet

Name: TheFinAI/ibm-project-codenet
Creator: TheFinAI
Published: 2026-04-10 03:44:50
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TheFinAI/ibm-project-codenet

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Source dtype: string - name: Date dtype: int64 - name: Text dtype: string - name: Token_count dtype: int64 splits: - name: train num_bytes: 8122744210 num_examples: 6366648 download_size: 3707767805 dataset_size: 8122744210 configs: - config_name: default data_files: - split: train path: data/train-* pretty_name: Project_CodeNet size_categories: - 1M<n<10M task_categories: - text-generation language: - code license: other --- # Project_CodeNet ## Overview This dataset is constructed from the **Project CodeNet** corpus, consisting of competitive programming submissions collected from online judges. We extract a large-scale code corpus designed for pretraining language models, with a focus on: - clean executable code - temporal metadata (submission time) - minimal preprocessing to preserve the original distribution --- ## Dataset Statistics - **Total samples:** ~6.37M - **Total tokens:** ~3.06B - **Average tokens per sample:** 480.44 ### Token Length Distribution - P50: 162 tokens - P90: 679 tokens - P95: 1035 tokens - P99: 2702 tokens --- ## Construction ### Source - Project CodeNet https://github.com/IBM/Project_CodeNet ### Filtering Rules We apply the following steps: 1. **Keep only Accepted submissions** - Removes incorrect or incomplete code. 2. **Deduplication at metadata level** - For each `(problem_id, user_id, language)`, keep the **last accepted submission** - This approximates the user's final solution 3. **No content-based deduplication** - Similar solutions across users are preserved - Reflects real-world submission distribution 4. **No balancing** - Language and temporal distributions are kept as-is --- ## Fields Each sample contains: | Field | Description | |------|------------| | `Source` | Dataset name (`Project_CodeNet`) | | `Date` | Submission year | | `Text` | Source code | | `Token_count` | Token count computed using `tiktoken` | --- ## Tokenization - Tokenizer: `tiktoken` - Encoding: `cl100k_base` --- ## Distribution Characteristics ### Language Distribution The dataset is highly skewed toward C++: - C++ dominates (~60%) - Python is the second largest (~23%) - Other languages form a long tail ### Temporal Distribution The dataset is heavily concentrated in recent years: - Majority of samples from **2019–2020** - Reflects real submission activity in CodeNet --- ## Important Notes - This dataset preserves the **original submission distribution** of CodeNet. - It is **not balanced** across languages or time. - It is primarily composed of **competitive programming code**, which may differ from production software code. - Some level of **near-duplicate solutions** exists due to similar problem-solving strategies. --- ## Intended Use - Pretraining code language models - Studying temporal evolution of programming patterns - Benchmarking under real-world distribution settings --- ## Limitations - Not representative of general software engineering code - Strong bias toward: - competitive programming tasks - algorithmic problem solving - Language and temporal imbalance --- ## License Please refer to the original **Project CodeNet** dataset for licensing details. --- ## Citation If you use this dataset, please cite Project CodeNet: @article{puri2021project, title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, author={Puri, Ruchir and others}, year={2021} }

dataset_info: 数据集信息： features: - name: Source dtype: string - name: Date dtype: int64 - name: Text dtype: string - name: Token_count dtype: int64 splits: - name: train num_bytes: 8122744210 num_examples: 6366648 download_size: 3707767805 dataset_size: 8122744210 configs: - config_name: default data_files: - split: train path: data/train-* pretty_name: Project_CodeNet size_categories: - 1M<n<10M 即样本量介于100万至1000万之间 task_categories: - 文本生成（text-generation） language: - 代码（code） license: 其他 # Project_CodeNet ## 概述本数据集源自**Project CodeNet**语料库，包含从在线评判平台收集的竞赛编程提交代码。我们构建了适用于大语言模型（Large Language Model）预训练的大规模代码语料库，重点关注以下几点： - 可正常运行的纯净代码 - 时序元数据（提交时间） - 最小化预处理以保留原始数据分布 --- ## 数据集统计 - **总样本量**：约637万 - **总Token数**：约30.6亿 - **单样本平均Token数**：480.44 ### Token长度分布 - P50分位数：162个Token - P90分位数：679个Token - P95分位数：1035个Token - P99分位数：2702个Token --- ## 数据集构建 ### 数据来源 - Project CodeNet：https://github.com/IBM/Project_CodeNet ### 过滤规则我们采用以下处理步骤： 1. **仅保留已通过评判的提交代码** - 移除错误或不完整的代码。 2. **元数据级别去重** - 针对每个`(题目ID, 用户ID, 编程语言)`组合，保留**最后一次通过的提交代码** - 这可近似反映用户的最终解题方案 3. **不进行基于内容的去重** - 保留不同用户提交的相似解题代码 - 贴合真实的提交数据分布 4. **不进行数据平衡** - 保留原始的语言和时序数据分布 --- ## 字段说明每个样本包含以下字段： | 字段 | 说明 | |------|------------| | `Source` | 数据集名称（固定为`Project_CodeNet`） | | `Date` | 代码提交年份 | | `Text` | 源代码文本 | | `Token_count` | 使用`tiktoken`计算得到的Token数量 | --- ## 分词方式 - 分词器：`tiktoken` - 编码格式：`cl100k_base` --- ## 分布特征 ### 语言分布本数据集的语言分布高度偏向C++： - C++占比最高（约60%） - Python占比次之（约23%） - 其余语言构成长尾分布 ### 时序分布本数据集的样本高度集中于近年： - 大部分样本来自**2019年至2020年** - 贴合Project CodeNet平台的真实提交活动规律 --- ## 重要说明 - 本数据集完整保留了Project CodeNet的**原始提交数据分布**。 - 数据集在语言和时间维度上均**未做平衡处理**。 - 数据集主体为**竞赛编程代码**，与工业生产级软件代码存在差异。 - 由于解题思路相似，数据集存在一定程度的**近似重复代码**。 --- ## 预期用途 - 用于代码大语言模型预训练 - 用于研究编程模式的时序演化规律 - 用于在真实数据分布场景下开展模型基准测试 --- ## 局限性 - 无法代表通用软件工程代码场景 - 存在显著偏向性： - 竞赛编程任务 - 算法解题场景 - 语言和时序分布不平衡 --- ## 许可协议许可协议详情请参阅原始**Project CodeNet**数据集文档。 --- ## 引用规范若您使用本数据集，请引用Project CodeNet： @article{puri2021project, title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, author={Puri, Ruchir and others}, year={2021} }

提供机构：

TheFinAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集