Enabling Architecture Traceability by LLM-based Architecture Component Name Extraction

Name: Enabling Architecture Traceability by LLM-based Architecture Component Name Extraction
Creator: Open Research Knowledge Graph
Published: 2025-08-29 13:04:58
License: 暂无描述

DataCite Commons2025-08-29 更新2026-05-04 收录

下载链接：

https://orkg.org/paper/R1467890

下载链接

链接失效反馈

官方服务：

资源简介：

Traceability Link Recovery (TLR) is an enabler for various software engineering tasks. One important task is the recovery of trace links between Software Architecture Documentation (SAD) and source code. Here, the main challenge is the semantic gap between the two artifact types. Recent research has shown that this semantic gap can be bridged by using Software Architecture Models (SAMs) as intermediates. However, the creation of SAMs is a manual and time-consuming task. This paper investigates the use of Large Language Models (LLMs) to extract component names as simple SAMs for TLR based on SAD and source code. By doing so, we aim to bridge the semantic gap between SAD and source code without the need for manual SAM creation. We compare our approach to the state-of-the-art TLR approaches TransArC and ArDoCode. TransArC is the currently best-performing approach for TLR between SAD and source code, but it requires SAMs as an additional artifact. Our evaluation shows that our approach performs comparable to TransArC (weighted average F1 with GPT-4o: 0.86 vs. TransArC’s 0.87), while only needing the SAD and source code. Moreover, our approach significantly outperforms the best baseline that does not need SAMs (weighted average F1 with GPT-4o: 0.86 vs. ArDoCode’s 0.62). In summary, our approach shows that LLMs can be used to make TLR between SAD and source code more applicable by extracting component names and omitting the need for manually created SAMs.

可追溯性链接恢复（Traceability Link Recovery, TLR）是支撑各类软件工程任务的核心赋能技术。其中一项重要任务是恢复软件架构文档（Software Architecture Documentation, SAD）与源代码之间的可追溯链接。此类任务的核心挑战在于两类工件之间存在的语义鸿沟。现有研究表明，以软件架构模型（Software Architecture Models, SAMs）作为中间载体可有效弥合该语义鸿沟，但SAM的构建需手动完成且耗时较长。本文探究了使用大语言模型（Large Language Models, LLMs）从SAD及源代码中提取组件名称，以此作为简易SAM用于TLR的方法，旨在无需手动创建SAM的前提下，弥合SAD与源代码间的语义鸿沟。我们将所提方法与当前前沿TLR方法TransArC和ArDoCode进行了对比：TransArC是目前SAD与源代码间TLR的最优性能方法，但需以SAM作为额外工件。实验评估结果显示，本文方法的表现与TransArC相当（GPT-4o加权平均F1值为0.86，TransArC为0.87），且仅需SAD与源代码作为输入。此外，本文方法显著优于无需SAM的最优基准方法ArDoCode（GPT-4o加权平均F1值为0.86 vs ArDoCode的0.62）。综上，本文方法证明，通过提取组件名称并省去手动构建SAM的步骤，大语言模型可有效提升SAD与源代码间TLR的实用性。

提供机构：

Open Research Knowledge Graph

创建时间：

2025-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集