SE Stopwords
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7865747
下载链接
链接失效反馈官方服务:
资源简介:
Overview
This repository contains stopword lists specifically tailored for natural language processing (NLP) tasks applied to software development documents. It aims to enhance the efficiency and accuracy of NLP applications on various types of software documentation, including bug reports, commit messages, and API documentation.
Background and Motivation
Stop words, deemed non-predictive, are often eliminated in NLP tasks. However, the definition of uninformative vocabulary remains vague, leading most algorithms to use general knowledge-based stop lists. The effectiveness of stop word elimination, particularly in domain-specific settings, is debated among academics.
In a recent paper, we investigated the usefulness of stop word removal in a software engineering context. To achieve this, we replicated and experimented with three software engineering research tools from related work. A corpus of software engineering domain-related text was constructed from 10,000 Stack Overflow questions, and 200 domain-specific stop words were identified using traditional information-theoretic methods.
The results demonstrated that using domain-specific stop words significantly improved the performance of research tools compared to a general stop list. Moreover, 17 out of 19 evaluation measures showed better performance.
Comparison to Baseline across 19 Metrics
The table below summarizes the performance improvements when using different stopword lists compared to the baseline across 19 metrics.
Stop word list
Better
Worse
Same
SE Domain (TF-IDF) (link)
17
1
1
SE Domain (Poisson) (link)
12
5
2
Technology Domain (link)
9
9
1
Large (link)
11
8
0
Medium (link)
11
7
1
Small (link)
13
5
1
Very Small (link)
10
7
2
No Stop Words
4
12
3
Usage Instructions
These stopword lists can be used to filter out uninformative words from software development documents, thereby improving the understanding and analysis of textual data in the software development domain.
To use these lists in your NLP tasks, simply import them into your project and apply them as filters during the pre-processing stage.
Folder Structure
SE-stopwords
|-- data_for_replications (contains all the required data for replicating software engineering tools)
| |-- Maalej_Dataset (original data for app review tool)
| `-- queries (queries used for RACKTool)
|-- stackoverflow_questions (more than 10k top reviewed questions on stackoverflow)
|-- stopwords_lists (all the stoplists)
|-- replications
`-- stackoverflow (code for creating the domain-specific corpus)
Detailed Results for the Three Replicated Tools
The results may vary by a small fraction depending on the trial, but they should be approximately the same as the tables below.
Tool 1 (App Review)
PD (bug report)
RT (rating)
FR (feature request)
UE (user experience)
Pre Rec F1
Pre Rec F1
Pre Rec F1
Pre Rec F1
SE domain (Poisson)
10.0% 37.5% 15.8%
72.1% 78.0% 74.9%
7.1% 29.8% 11.5%
11.6% 32.0% 17.0%
SE domain (TF-IDF)
10.7% 40.2% 16.9%
72.2% 78.2% 75.1%
7.9% 33.3% 12.8%
11.7% 32.5% 17.2%
Tool 2 (RACK)
Top-10
MRR@10
MAP@10
MR@K
SE domain (Poisson)
83.85%
52.29%
43.27%
54.47%
SE domain (TF-IDF)
84.17%
53.20%
45.82%
56.8%
Tool 3 (Requirements Change Impact Analysis)
SE domain (Poisson)
SE domain (TF-IDF)
Query 2
0.588
0.588
Query 4
0.981
0.981
Query 5
0.602
0.602
Citation
If you make use of this work, please cite:
@inproceedings{fan2023stop,
title={Stop Words for Processing Software Engineering Documents: Do they Matter?},
author={Yaohou Fan and Chetan Arora and Christoph Treude},
booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
year={2023},
organization={IEEE}
}
创建时间:
2023-04-26



