five

An Empirical Study of Next-Line Prediction in Build Systems Using CodeGen

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14052989
下载链接
链接失效反馈
官方服务:
资源简介:
An Empirical Study of Next-Line Prediction in BuildSystems Using CodeGen. Build systems play a crucial role in software development and are responsible for compiling source code into executable programs. Despite their importance, build systems often receive limited attention because their impact is not directly visible to end users. This oversight can lead to inadequate maintenance, frequent build failures, and disruptions that require additional resources. Recognising and addressing the maintenance needs of build systems is essential to preventing costly disruptions and ensuring efficient software production. In this paper, we explore whether applying a Large Language Model (LLM) can reduce the burden of maintaining build systems. We aim to determine whether the prior content in build specifications provides sufficient context for an LLM to generate subsequent lines accurately. We conduct an empirical study on CodeGen, a state-of-the-art Large Language Model (LLM), using a dataset of 13,343 Maven build files. The dataset consists of the Expert dataset from the Apache Software Foundation (ASF) for fine-tuning (9,426 build files) and the Generalised dataset from GitHub for testing (3,917 build files). We observe that (i) fine-tuning on a small portion of data (i.e., 11\% of the fine-tuning dataset) provides the largest improvement in performance by 13.93\% (ii) When applied to the Generalised dataset, our fine-tuned model retains 83.86\% of its performance, indicating that it is not overfitted. Upon further investigation, we classify build code content into functional and metadata subgroups based on enclosing tags. Our fine-tuned model performs substantially better in suggesting functional than metadata build code. Our findings highlight the potential of leveraging LLMs like CodeGen to relieve the maintenance challenges associated with build systems, particularly in functional content. The study highlights the limitations of large language models in suggesting the metadata components of build code. Future research should focus on developing approaches to enhance the accuracy and effectiveness of metadata generation.   Replication Package Structure: dataset.zip: Contains dataset for Expert and Generalised dataset.
创建时间:
2024-11-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作