five

CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14281412
下载链接
链接失效反馈
官方服务:
资源简介:
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers frequently face the challenge of upgrading their codebase to adapt new programming language versions, which is a tedious and time-consuming process. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade efforts as well. Despite their promise, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling relevant code changes related to programming language evolution from real-world software repositories' commit histories is a complex challenge.In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade in Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7--23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features, and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build system upgrades. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
创建时间:
2024-12-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作