An Empirical Study of Prompt Engineering in LLMs for User Story Generation

Name: An Empirical Study of Prompt Engineering in LLMs for User Story Generation
Creator: Zenodo
Published: 2026-05-03 23:55:09
License: 暂无描述

DataCite Commons2026-05-03 更新2026-05-07 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.20017105

下载链接

链接失效反馈

官方服务：

资源简介：

Large Language Models (LLMs) have been increasingly used to support Requirements Engineering (RE) activities, particularly for generating user stories. However, there are still uncertainties about how these models behave in this process when different LLMs are applied under the same prompting instructions for elicitation, especially regarding the level of consistency in the generated outputs (aiming to minimize hallucination) and its relationship with the quality perceived by specialists, thus creating an exploratory margin to assess the potential of these models as support for agile practices, where predictability and artifact quality are critical factors. In this context, this work analyzes the semantic consistency of user stories generated by three LLMs (ChatGPT 5 Instant, DeepSeek, and Gemini 2.5 Pro) across four domains, following the 3Cs structure for user story formulation. Textual consistency—also treated as semantic consistency throughout the study—was automatically evaluated using text embeddings and cosine similarity between stories generated in different executions of the same protocol. In parallel, a controlled subset of the artifacts was selected for qualitative evaluation across two domains, assessed by two Software Engineering specialists using criteria inspired by the Quality User Story (QUS) framework. The results indicate moderate to high levels of semantic consistency (approximately ranging from 0.59 to 0.73 in average similarity), with variations across models and domains. Overall, greater stability was observed in the outputs generated by ChatGPT and DeepSeek, while Gemini showed higher variation across executions, reflecting lower semantic uniformity. In the qualitative evaluation, the stories were considered suitable as a starting point but had recurring weaknesses, particularly in the criteria of atomicity and testability. Finally, the results show that semantic similarity across generations and the quality perceived by specialists do not present a direct relationship, indicating that these dimensions capture distinct aspects of model behavior and reinforcing the importance of considering multiple perspectives when evaluating the use of LLMs in RE.

提供机构：

Zenodo

创建时间：

2026-05-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集