Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10521908

下载链接

链接失效反馈

官方服务：

资源简介：

Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced. In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches. Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches: Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github] eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare] eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github] MT-Text-CNN [Github] Structure of the Replication Package: In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD. ├── SATD Keywords │ ├── Keywords based on Source of Artifacts │ │ ├── Code comment.txt │ │ ├── Commit message.txt │ │ ├── Issue section.txt │ │ └── Pull section.txt │ ├── Keywords based on Types of SATD │ │ ├── code-design debt.txt │ │ ├── documentation debt.txt │ │ ├── requirement debt.txt │ │ └── test debt.txt ├── src │ ├── bert.py │ ├── bilstm.py │ └── preprocessing.py ├── data-augmentation-code_comments.csv ├── data-augmentation-commit_messages.csv ├── data-augmentation-issues.csv ├── data-augmentation-pull_requests.csv └── Supplementary Material.docx Requirements: glove nltk transformers torch tensorflow keras langdetect inflect inflection Project sources for each artifact are as follows: Source code comment Issue section Pull section Commit message antargoumlcolumbaemfhibernatejeditjfreechartjmeterjrubysquirrel camel chromium gerrit hadoop hbase impala thrift accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

创建时间：

2024-04-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集