Replication Package of "A Catalog of Data Smells for Coding Tasks"

Name: Replication Package of "A Catalog of Data Smells for Coding Tasks"
Creator: figshare
Published: 2025-06-01 05:12:49
License: 暂无描述

DataCite Commons2025-06-01 更新2025-01-06 收录

下载链接：

https://figshare.com/articles/dataset/Replication_Package_of_A_Catalog_of_Data_Smells_for_Coding_Tasks_/25898650/1

下载链接

链接失效反馈

官方服务：

资源简介：

Large Language Models (LLMs) are increasingly becoming fundamental in supporting software developers in coding tasks. The massive datasets used for training LLMs are often collected automatically, leading to the introduction of data smells. Previous work addressed this issue by using quality filters to handle some specific smells. Still, the literature lacks a systematic catalog of the data smells for coding tasks currently known. This paper presents a Systematic Literature Review (SLR) focused on articles that introduce LLMs for coding tasks. We first extracted the quality filters adopted for training and testing such LLMs, inferred the root problem behind their adoption (data smells for coding tasks), and defined a taxonomy of such smells. Our result highlight discrepancies in the adoption of quality filters between pre-training and fine-tuning stages and across different coding tasks, shedding light on areas for improvement in LLM-based software development support.

大语言模型（Large Language Models，LLMs）正日益成为支持软件开发人员完成编码任务的基础工具。用于训练LLMs的海量数据集通常通过自动方式收集，这导致了数据异味（data smells）的引入。现有研究通过使用质量过滤器处理部分特定的数据异味来解决这一问题，但目前文献中仍缺乏针对编码任务的数据异味的系统性分类目录。本文开展了一项系统性文献综述（Systematic Literature Review，SLR），重点关注那些介绍用于编码任务的LLMs的文章。我们首先提取了用于训练和测试此类LLMs所采用的质量过滤器，推断出这些过滤器被采用背后的根本问题（即编码任务的数据异味），并定义了此类异味的分类体系。研究结果揭示了预训练与微调阶段之间以及不同编码任务之间在质量过滤器采用上的差异，为基于LLM的软件开发支持领域的改进方向提供了洞见。

提供机构：

figshare

创建时间：

2024-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集