Towards Automatic Labeling of Exception Handling Bugs: A Case Study of 10 Years Bug-Fixing in Apache Hadoop
收藏DataCite Commons2024-04-29 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Towards_Automatic_Labeling_of_Exception_Handling_Bugs_A_Case_Study_of_10_Years_Bug-Fixing_in_Apache_Hadoop/22735124
下载链接
链接失效反馈官方服务:
资源简介:
<b>Context:</b> Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.<b>Objective:</b> First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.<b>Method:</b> We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.<b>Results:</b> Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).<b>Conclusions:</b> To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
背景:异常处理(Exception Handling,EH)缺陷源于异常处理机制(Exception Handling Mechanisms,EHMs)的不当使用,往往会造成系统宕机、数据丢失、安全风险等严重后果。对于云服务、人工智能系统等当代软件系统而言,异常处理缺陷的追踪尤为关键——这类系统的软件逻辑愈发复杂,进一步提升了异常处理机制正确使用的难度。此外,缺陷报告提交者通常难以准确标记异常处理缺陷,因为这需要全面掌握对应软件的异常处理策略。令人意外的是,据我们所知,目前尚无自动化方法可从缺陷报告的描述文本中识别异常处理缺陷。
目标:本研究主要达成两个目标:其一,评估自然语言处理(Natural Language Processing,NLP)与机器学习(Machine Learning,ML)技术能否依托缺陷报告的文本字段(如摘要、描述与评论),实现可靠的异常处理缺陷标注;其二,构建一个经过可靠标注的数据集,供社区后续研究使用。总体而言,我们期望通过本研究提升社区对异常处理缺陷重要性的认知。
方法:我们手动分析了Apache Hadoop项目四大核心组件的4516份缺陷报告,其中约20%(共943份)被标注为异常处理缺陷。同时,我们通过分析缺陷修复代码,额外标注了2584份非异常处理缺陷,最终构建了包含7100份缺陷报告的数据集。随后,我们采用词嵌入技术(词袋模型与TF-IDF)对缺陷报告的文本字段进行向量化表征,并基于这些嵌入向量训练了五类机器学习模型,在未见过的测试数据上进行性能评估。此外,我们还使用基于预训练Transformer的模型对完整文本字段进行了评估,并验证了仅使用异常处理关键词能否实现较高的预测性能。
结果:实验结果表明,结合本研究构建的数据集训练的预训练DistilBERT与线性分类层,可较为精准地完成异常处理缺陷的标注任务,其ROC-AUC(受试者工作特征曲线下面积)最高可达0.88。传统自然语言处理与机器学习组合技术的ROC-AUC最高为0.74,召回率最高可达0.56。作为对照验证,我们还测试了仅使用关键词提取的嵌入向量训练的模型。以ROC-AUC为核心评价指标时,多数测试的机器学习模型显示,仅依靠关键词不足以准确表征异常处理缺陷报告,但该结论会随其他指标(如召回率与精确率)或机器学习模型(如随机森林)的不同而有所变化。
结论:据我们所知,本研究是首个针对异常处理缺陷自动标注问题的探索性工作。基于实验结果,我们可以得出结论:机器学习技术,尤其是基于Transformer的模型,在自动化异常处理缺陷标注任务中展现出了良好的应用前景。总体而言,我们期望:其一,本研究能够提升社区对异常处理缺陷问题的关注度;其二,本次公开可用的数据集可作为基准数据集,为后续相关研究铺平道路。此外,我们的研究成果可用于开发辅助工具,帮助维护人员在缺陷分流(triage)流程中识别异常处理缺陷。
提供机构:
figshare
创建时间:
2023-05-03



