ibm-project-codenet

Name: ibm-project-codenet
Creator: The Fin AI
Published: 2026-04-10 11:44:50
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-11 收录

下载链接：

https://huggingface.co/datasets/TheFinAI/ibm-project-codenet

下载链接

链接失效反馈

官方服务：

资源简介：

Project_CodeNet数据集是一个大规模代码语料库，专为预训练语言模型而设计，主要来源于在线判题系统收集的竞争性编程提交。数据集包含约637万样本，总计约30.6亿令牌，平均每个样本包含480.44个令牌。每个样本包含四个字段：Source（数据集名称）、Date（提交年份）、Text（源代码）和Token_count（令牌计数）。数据集经过筛选，仅保留被接受的提交，并对每个问题、用户和语言组合保留最后一次成功提交，以近似用户最终解决方案。数据集保留了原始提交分布，未进行内容去重或平衡处理，因此在语言和时间分布上存在明显偏斜（C++占约60%，Python占23%，且大部分样本集中在2019-2020年）。该数据集适用于代码语言模型的预训练、编程模式的时间演变研究以及在真实分布设置下的基准测试。需要注意的是，数据集主要包含竞争性编程代码，与生产软件代码存在差异，且在语言和时间上存在不平衡。

Project_CodeNet is a large-scale code corpus designed explicitly for pre-trained language models, derived primarily from competitive programming submissions collected via online judging systems. The dataset contains approximately 6.37 million samples, totaling roughly 3.06 billion tokens, with an average of 480.44 tokens per sample. Each sample consists of four fields: Source (dataset name), Date (submission year), Text (source code), and Token_count (token count). The dataset has been filtered to retain only accepted submissions, and preserves the most recent successful submission for each combination of problem, user, and programming language to approximate the user's final solution. The original submission distribution is fully preserved without content deduplication or balancing, leading to notable skews in both language and temporal distributions: C++ accounts for approximately 60% of the samples, Python for 23%, and the majority of samples are concentrated in the 2019-2020 period. This dataset is suitable for pre-training code language models, researching the temporal evolution of programming patterns, and conducting benchmark tests under real-world distribution settings. It is important to note that the dataset primarily consists of competitive programming code, which differs from production-grade software code, and exhibits imbalance in terms of language and temporal distribution.

提供机构：

The Fin AI

创建时间：

2026-04-10