Deconstructing Model Collapse in Software Engineering Tasks: A Multi-Granularity Empirical Study with Open-Source LLMs
收藏DataCite Commons2026-01-19 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/The_Self-Inflicted_Collapse_How_Recursive_Training_Undermines_Large_Language_Models_in_Automated_Software_Engineering_Tasks/28559318/6
下载链接
链接失效反馈官方服务:
资源简介:
<b>Large Language Models</b> (LLMs) have become indispensable for automated software engineering (SE) tasks, such as code generation, vulnerability detection, and code summarization. Yet, their long-term robustness is strongly shaped by training methodology. A particularly risky practice is recursive self-training, where models are repeatedly fine-tuned on their own generated outputs. While this strategy is often adopted to compensate for scarce human-annotated data, it carries the danger of <b>model collapse</b>—a degenerative process in which output quality, diversity, and reliability degrade across generations.This paper provides the first multi-granularity empirical study of model collapse in SE tasks. Using open-source LLMs — LLaMA-3 (1B, 3B, 8B, 70B) [1], LLaMA-4 Scout (17B MoE) [2], and Qwen-3 (0.6B, 1.7B, 8B, 14B, 30B) [3] — we design controlled recursive training experiments across three benchmarks:HumanEval: code generation, evaluated with Pass@1 and BLEU-4ReVeal: vulnerability detection, evaluated with F1/precision/recallCodeSearchNet: code summarization, evaluated with BLEU-4 and ROUGE-LModels are trained under three regimes — real-only, synthetic-only, and hybrid — for up to ten recursive generations. We then analyze collapse dynamics at multiple granularities: task-level degradation, data distribution drift (perplexity/entropy), and mitigation effectiveness.Our findings show that synthetic-only recursive training leads to sharp degradation, especially in smaller models, while hybrid strategies and quality filtering significantly slow collapse but cannot eliminate it entirely. These results demonstrate that collapse in SE is not a simple extension of language collapse but a domain-specific phenomenon, driven by the structural and security-critical nature of code.This study contributes:A systematic framework to reveal model collapse in SE tasksEmpirical evidence across model scales, training regimes, and tasksValidated mitigation strategies (hybrid training, filtering, and diversity preservation) with practical implications for building stable LLM pipelines in software engineering🔎 Reproducibility: All details necessary to reproduce our experiments — including preprocessing, hyperparameters, prompt templates, recursive training design, evaluation metrics, and mitigation strategies — are provided in the <b>README</b>.
提供机构:
figshare
创建时间:
2025-09-12



