five

SE Stopwords

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7865747
下载链接
链接失效反馈
官方服务:
资源简介:
Overview This repository contains stopword lists specifically tailored for natural language processing (NLP) tasks applied to software development documents. It aims to enhance the efficiency and accuracy of NLP applications on various types of software documentation, including bug reports, commit messages, and API documentation. Background and Motivation Stop words, deemed non-predictive, are often eliminated in NLP tasks. However, the definition of uninformative vocabulary remains vague, leading most algorithms to use general knowledge-based stop lists. The effectiveness of stop word elimination, particularly in domain-specific settings, is debated among academics. In a recent paper, we investigated the usefulness of stop word removal in a software engineering context. To achieve this, we replicated and experimented with three software engineering research tools from related work. A corpus of software engineering domain-related text was constructed from 10,000 Stack Overflow questions, and 200 domain-specific stop words were identified using traditional information-theoretic methods. The results demonstrated that using domain-specific stop words significantly improved the performance of research tools compared to a general stop list. Moreover, 17 out of 19 evaluation measures showed better performance. Comparison to Baseline across 19 Metrics The table below summarizes the performance improvements when using different stopword lists compared to the baseline across 19 metrics. Stop word list Better Worse Same SE Domain (TF-IDF) (link) 17 1 1 SE Domain (Poisson) (link) 12 5 2 Technology Domain (link) 9 9 1 Large (link) 11 8 0 Medium (link) 11 7 1 Small (link) 13 5 1 Very Small (link) 10 7 2 No Stop Words 4 12 3 Usage Instructions These stopword lists can be used to filter out uninformative words from software development documents, thereby improving the understanding and analysis of textual data in the software development domain. To use these lists in your NLP tasks, simply import them into your project and apply them as filters during the pre-processing stage. Folder Structure SE-stopwords |-- data_for_replications (contains all the required data for replicating software engineering tools) | |-- Maalej_Dataset (original data for app review tool) | `-- queries (queries used for RACKTool) |-- stackoverflow_questions (more than 10k top reviewed questions on stackoverflow) |-- stopwords_lists (all the stoplists) |-- replications `-- stackoverflow (code for creating the domain-specific corpus) Detailed Results for the Three Replicated Tools The results may vary by a small fraction depending on the trial, but they should be approximately the same as the tables below. Tool 1 (App Review)   PD  (bug report) RT  (rating) FR  (feature request) UE  (user experience)   Pre  Rec  F1 Pre  Rec  F1 Pre  Rec  F1 Pre  Rec  F1 SE domain (Poisson) 10.0% 37.5% 15.8% 72.1% 78.0% 74.9% 7.1% 29.8% 11.5% 11.6% 32.0% 17.0% SE domain (TF-IDF) 10.7% 40.2% 16.9% 72.2% 78.2% 75.1% 7.9% 33.3% 12.8% 11.7% 32.5% 17.2% Tool 2 (RACK)   Top-10 MRR@10 MAP@10 MR@K SE domain (Poisson) 83.85% 52.29% 43.27% 54.47% SE domain (TF-IDF) 84.17% 53.20% 45.82% 56.8% Tool 3 (Requirements Change Impact Analysis)   SE domain (Poisson) SE domain (TF-IDF) Query 2 0.588 0.588 Query 4 0.981 0.981 Query 5 0.602 0.602 Citation If you make use of this work, please cite: @inproceedings{fan2023stop, title={Stop Words for Processing Software Engineering Documents: Do they Matter?}, author={Yaohou Fan and Chetan Arora and Christoph Treude}, booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)}, year={2023}, organization={IEEE} }
创建时间:
2023-04-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作