Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10521908
下载链接
链接失效反馈官方服务:
资源简介:
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.
In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.
Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:
Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
MT-Text-CNN [Github]
Structure of the Replication Package:
In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.
├── SATD Keywords
│ ├── Keywords based on Source of Artifacts
│ │ ├── Code comment.txt
│ │ ├── Commit message.txt
│ │ ├── Issue section.txt
│ │ └── Pull section.txt
│ ├── Keywords based on Types of SATD
│ │ ├── code-design debt.txt
│ │ ├── documentation debt.txt
│ │ ├── requirement debt.txt
│ │ └── test debt.txt
├── src
│ ├── bert.py
│ ├── bilstm.py
│ └── preprocessing.py
├── data-augmentation-code_comments.csv
├── data-augmentation-commit_messages.csv
├── data-augmentation-issues.csv
├── data-augmentation-pull_requests.csv
└── Supplementary Material.docx
Requirements:
glove
nltk
transformers
torch
tensorflow
keras
langdetect
inflect
inflection
Project sources for each artifact are as follows:
Source code comment
Issue section
Pull section
Commit message
antargoumlcolumbaemfhibernatejeditjfreechartjmeterjrubysquirrel
camel
chromium
gerrit
hadoop
hbase
impala
thrift
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data
创建时间:
2024-04-24



