five

TheFinAI/ibm-project-codenet

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheFinAI/ibm-project-codenet
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Source dtype: string - name: Date dtype: int64 - name: Text dtype: string - name: Token_count dtype: int64 splits: - name: train num_bytes: 8122744210 num_examples: 6366648 download_size: 3707767805 dataset_size: 8122744210 configs: - config_name: default data_files: - split: train path: data/train-* pretty_name: Project_CodeNet size_categories: - 1M<n<10M task_categories: - text-generation language: - code license: other --- # Project_CodeNet ## Overview This dataset is constructed from the **Project CodeNet** corpus, consisting of competitive programming submissions collected from online judges. We extract a large-scale code corpus designed for pretraining language models, with a focus on: - clean executable code - temporal metadata (submission time) - minimal preprocessing to preserve the original distribution --- ## Dataset Statistics - **Total samples:** ~6.37M - **Total tokens:** ~3.06B - **Average tokens per sample:** 480.44 ### Token Length Distribution - P50: 162 tokens - P90: 679 tokens - P95: 1035 tokens - P99: 2702 tokens --- ## Construction ### Source - Project CodeNet https://github.com/IBM/Project_CodeNet ### Filtering Rules We apply the following steps: 1. **Keep only Accepted submissions** - Removes incorrect or incomplete code. 2. **Deduplication at metadata level** - For each `(problem_id, user_id, language)`, keep the **last accepted submission** - This approximates the user's final solution 3. **No content-based deduplication** - Similar solutions across users are preserved - Reflects real-world submission distribution 4. **No balancing** - Language and temporal distributions are kept as-is --- ## Fields Each sample contains: | Field | Description | |------|------------| | `Source` | Dataset name (`Project_CodeNet`) | | `Date` | Submission year | | `Text` | Source code | | `Token_count` | Token count computed using `tiktoken` | --- ## Tokenization - Tokenizer: `tiktoken` - Encoding: `cl100k_base` --- ## Distribution Characteristics ### Language Distribution The dataset is highly skewed toward C++: - C++ dominates (~60%) - Python is the second largest (~23%) - Other languages form a long tail ### Temporal Distribution The dataset is heavily concentrated in recent years: - Majority of samples from **2019–2020** - Reflects real submission activity in CodeNet --- ## Important Notes - This dataset preserves the **original submission distribution** of CodeNet. - It is **not balanced** across languages or time. - It is primarily composed of **competitive programming code**, which may differ from production software code. - Some level of **near-duplicate solutions** exists due to similar problem-solving strategies. --- ## Intended Use - Pretraining code language models - Studying temporal evolution of programming patterns - Benchmarking under real-world distribution settings --- ## Limitations - Not representative of general software engineering code - Strong bias toward: - competitive programming tasks - algorithmic problem solving - Language and temporal imbalance --- ## License Please refer to the original **Project CodeNet** dataset for licensing details. --- ## Citation If you use this dataset, please cite Project CodeNet: @article{puri2021project, title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, author={Puri, Ruchir and others}, year={2021} }

dataset_info: 数据集信息: features: - name: Source dtype: string - name: Date dtype: int64 - name: Text dtype: string - name: Token_count dtype: int64 splits: - name: train num_bytes: 8122744210 num_examples: 6366648 download_size: 3707767805 dataset_size: 8122744210 configs: - config_name: default data_files: - split: train path: data/train-* pretty_name: Project_CodeNet size_categories: - 1M<n<10M 即样本量介于100万至1000万之间 task_categories: - 文本生成(text-generation) language: - 代码(code) license: 其他 # Project_CodeNet ## 概述 本数据集源自**Project CodeNet**语料库,包含从在线评判平台收集的竞赛编程提交代码。 我们构建了适用于大语言模型(Large Language Model)预训练的大规模代码语料库,重点关注以下几点: - 可正常运行的纯净代码 - 时序元数据(提交时间) - 最小化预处理以保留原始数据分布 --- ## 数据集统计 - **总样本量**:约637万 - **总Token数**:约30.6亿 - **单样本平均Token数**:480.44 ### Token长度分布 - P50分位数:162个Token - P90分位数:679个Token - P95分位数:1035个Token - P99分位数:2702个Token --- ## 数据集构建 ### 数据来源 - Project CodeNet:https://github.com/IBM/Project_CodeNet ### 过滤规则 我们采用以下处理步骤: 1. **仅保留已通过评判的提交代码** - 移除错误或不完整的代码。 2. **元数据级别去重** - 针对每个`(题目ID, 用户ID, 编程语言)`组合,保留**最后一次通过的提交代码** - 这可近似反映用户的最终解题方案 3. **不进行基于内容的去重** - 保留不同用户提交的相似解题代码 - 贴合真实的提交数据分布 4. **不进行数据平衡** - 保留原始的语言和时序数据分布 --- ## 字段说明 每个样本包含以下字段: | 字段 | 说明 | |------|------------| | `Source` | 数据集名称(固定为`Project_CodeNet`) | | `Date` | 代码提交年份 | | `Text` | 源代码文本 | | `Token_count` | 使用`tiktoken`计算得到的Token数量 | --- ## 分词方式 - 分词器:`tiktoken` - 编码格式:`cl100k_base` --- ## 分布特征 ### 语言分布 本数据集的语言分布高度偏向C++: - C++占比最高(约60%) - Python占比次之(约23%) - 其余语言构成长尾分布 ### 时序分布 本数据集的样本高度集中于近年: - 大部分样本来自**2019年至2020年** - 贴合Project CodeNet平台的真实提交活动规律 --- ## 重要说明 - 本数据集完整保留了Project CodeNet的**原始提交数据分布**。 - 数据集在语言和时间维度上均**未做平衡处理**。 - 数据集主体为**竞赛编程代码**,与工业生产级软件代码存在差异。 - 由于解题思路相似,数据集存在一定程度的**近似重复代码**。 --- ## 预期用途 - 用于代码大语言模型预训练 - 用于研究编程模式的时序演化规律 - 用于在真实数据分布场景下开展模型基准测试 --- ## 局限性 - 无法代表通用软件工程代码场景 - 存在显著偏向性: - 竞赛编程任务 - 算法解题场景 - 语言和时序分布不平衡 --- ## 许可协议 许可协议详情请参阅原始**Project CodeNet**数据集文档。 --- ## 引用规范 若您使用本数据集,请引用Project CodeNet: @article{puri2021project, title={Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, author={Puri, Ruchir and others}, year={2021} }
提供机构:
TheFinAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作