When AI-Generated Data Dominates: Model Collapse in Software Engineering Tasks
收藏Figshare2025-03-08 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/The_Self-Inflicted_Collapse_How_Recursive_Training_Undermines_Large_Language_Models_in_Automated_Software_Engineering_Tasks/28559318
下载链接
链接失效反馈官方服务:
资源简介:
Large Language Models (LLMs) have become indispensable for automated software engineering (SE) tasks, such as code generation, vulnerability detection, and code summarization. Yet, their long-term robustness is strongly shaped by the composition of training data. A particularly risky practice arises when models are extensively updated using AI-generated software artifacts. While such data reuse is often adopted to compensate for scarce human-annotated data, it introduces the risk of model collapse—a degenerative process in which output quality, diversity, and reliability progressively degrade as exposure to synthetic data increases.This paper provides the first multi-granularity empirical study of model collapse in SE tasks. Using open-source LLMs—including LLaMA-3 (1B, 3B, 8B, 70B)~\cite{Llama32}, LLaMA-4 Scout (17B MoE)~\cite{Llama4}, and Qwen-3 (0.6B, 1.7B, 8B, 14B, 30B)~\cite{Qwen3}—we design controlled model update experiments under evolving data compositions across three benchmarks:HumanEval: code generation, evaluated with Pass@1 and BLEU-4ReVeal: vulnerability detection, evaluated with F1, precision, and recallCodeSearchNet: code summarization, evaluated with BLEU-4 and ROUGE-LModels are updated under three data regimes—real-only, synthetic-only, and hybrid—over up to ten successive exposure stages. We then analyze collapse dynamics at multiple granularities, including task-level performance degradation, data distribution drift (perplexity and entropy), and mitigation effectiveness.Our findings show that exclusive reliance on AI-generated data leads to sharp performance degradation, particularly in smaller models, while hybrid data composition and quality-based filtering significantly slow collapse but cannot fully eliminate it. These results demonstrate that collapse in SE is not a simple extension of language-domain degradation, but a domain-specific phenomenon driven by the structural constraints, long-tail dependencies, and security-critical semantics of source code.This study contributes:A systematic framework for analyzing model collapse in software engineering tasksEmpirical evidence across model scales, data composition regimes, and SE task typesValidated mitigation strategies—including hybrid data anchoring, quality filtering, and diversity preservation—with practical implications for building stable LLM pipelines in software engineering🔎 Reproducibility: All details required to reproduce our experiments—including preprocessing steps, hyperparameters, prompt templates, dataset composition strategies, evaluation protocols, and mitigation mechanisms—are provided in the accompanying README.
创建时间:
2025-03-08



