On the role of data balancing for Machine Learning-based Code Smell Detection

Name: On the role of data balancing for Machine Learning-based Code Smell Detection
Creator: figshare
Published: 2020-08-27 03:47:16
License: 暂无描述

DataCite Commons2020-08-27 更新2024-07-27 收录

下载链接：

https://figshare.com/articles/On_the_role_of_data_balancing_for_Machine_Learning-based_Code_Smell_Detection/8247509/1

下载链接

链接失效反馈

官方服务：

资源简介：

Code smells can compromise software quality in the long term by inducing technical debt.For this reason, in the last decade many approaches aimed at identifying these design flaws have been proposed.Most of them are based on heuristics in which a set of metrics (e.g., code metrics, process metrics) is used to detect smelly code components.However, these techniques suffer of subjective interpretation, low agreement between detectors, and threshold dependability.To overcome the limitations, previous work applied Machine Learning techniques that can learn from previous datasets without needing any threshold definition.However, more recent work has shown that Machine Learning is not always suitable for code smell detection due to the highly unbalanced nature of the problem.In this study we investigate several approaches able to mitigate data unbalancing issues to understand their impact on ML-based code smells detection algorithms.Our findings highlight a number of limitations and open issues with respect to the usage of data balancing for ML-based code smell detection.

代码坏味（Code smells）会通过催生技术债务，长期损害软件质量。为此，近十年来学术界已提出诸多旨在识别这类设计缺陷的方法。其中多数方法基于启发式逻辑，通过一组度量指标（如代码度量、过程度量）检测存在坏味的代码组件。然而此类技术存在主观解读偏差、检测器间一致性不足以及阈值依赖等局限。为克服上述问题，此前的研究应用了机器学习（Machine Learning）技术，该技术可从过往数据集自主学习，无需预先定义阈值。但近期研究表明，由于该问题存在严重的数据不平衡特性，机器学习并非总能适用于代码坏味检测任务。本研究针对多种可缓解数据不平衡问题的方法展开探究，以明晰其对基于机器学习的代码坏味检测算法的影响。本研究的结果揭示了在基于机器学习的代码坏味检测中使用数据平衡技术时存在的诸多局限与待解决问题。

提供机构：

figshare

创建时间：

2019-06-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集