The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks

Name: The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks
Creator: figshare
Published: 2025-06-01 03:52:11
License: 暂无描述

DataCite Commons2025-06-01 更新2025-05-07 收录

下载链接：

https://figshare.com/articles/dataset/The_Self-Inflicted_Collapse_How_Recursive_Training_Undermines_Large_Language_Models_in_Automated_Software_Engineering_Tasks/28559318/1

下载链接

链接失效反馈

官方服务：

资源简介：

Large Language Models (LLMs) have revolutionized natural language processing and are now integral to various automated software engineering tasks, such as code generation, vulnerability detection, and code summarization. However, the way these models are trained critically affects their long-term performance. In particular, recursive self-training—where models are continuously fine-tuned on data generated by their own outputs—poses a significant challenge, as it can lead to the gradual accumulation of errors and a phenomenon known as model collapse.This paper, "The Self-Inflicted Collapse: How Recursive Training Undermines Large Language Models in Automated Software Engineering Tasks," investigates the impact of recursive training on LLMs. Our study leverages three well-known datasets:HumanEval is used for the code generation task, providing a collection of programming problems with reference solutions to measure accuracy through the pass@1 metric.CodeSearchNet serves the code summarization task, offering paired code snippets and human-written summaries, with performance evaluated using BLEU-4 scores.ReVeal Dataset is employed for the vulnerability detection task, containing annotated smart contract code and detailed vulnerability reports, with performance assessed via the F1 score.We benchmark six models—ChatGPT 4o, ChatGPT 4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, and Llama 3.2—across these tasks. First, baseline performance is established by fine-tuning each model exclusively on high-quality human-generated data. Then, we simulate a recursive training scenario in which the models are continuously fine-tuned on their own generated outputs over 10 generations. Performance is monitored through various metrics, including pass@1, F1 score, BLEU-4, and perplexity, to capture how recursive self-training affects each model's predictive capability.Our experimental results reveal a consistent pattern of performance degradation when models are trained solely on their own outputs. As the generations progress, key metrics decline and perplexity increases, providing quantitative evidence of model collapse. This study highlights the risks associated with recursive self-training and underscores the need for improved training paradigms to maintain the robustness of LLMs in automated software engineering applications.

提供机构：

figshare

创建时间：

2025-03-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集