GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8271703

下载链接

链接失效反馈

官方服务：

资源简介：

The first dataset, PR Comments Dataset was curated specifically for a specialized Reinforcement Learning formalization for Pull-Request (PR) outcome predictions on GitHub using just the developer discussions. It contains over 5,88,097 in-line code comments of 66,281 PRs and a total of 15 features. The raw comments and the respective commit_ids were extracted from the work publised by Akshay Sinha (refer to the references). The data spans from January 2015 to December 2020. All the other features were augmented using the GitHub REST API. Feature Extraction resulted in addition of following features: has_code_element: whether the comment makes a code suggestion or not word_count: no. of words in the comments (British and American English only based on Hunspell Library) stopw_ratio: ratio of no. of stop words to total word count in the comment Sentiment Analysis conducted using VADER resulting in addition of: neg_vr: negative polarity score neu_vr: neutral polarity score pos_vr: positive polarity score compound: overall polarity score of the comment Other PR and project related features include: owner_name: the account owner of the repo (not case sensitive) repo_name: the name of the repo without the .git extension (not case sensitive) pull_no: the number to identify the PR merged_or_not: whether PR has been merged or not timestamp: for each comment The dataset contains a little under 0.6 million comments associated with around 66,000 PRs. To view the PRs (consequently the related comments), group by using: owner_name, repo_name, pull_no. The second dataset is the collection of responses of an online exploratory survey targeting software developers and engineers. The underpinning objective was to delve deep into the developers' perspectives regarding the PR review processes and the quality of these reviews. We received a total of 22 responses. We designed a survey protocol following #### University's guidelines for on-line research, adhering to the #### (####) in ####. After careful evaluation by #### University's Research Ethics Boards, in alignment with TCPS2, we received approval on May 2, 2023 (Ethics Clearance ID #####), effective until May 31, 2023. The survey was carefully structured into three distinct sections. The initial section delved into the participant's demographic and professional background, featuring six primary questions, along with an optional seventh question. Prioritizing participant confidentiality, the survey was designed to safeguard anonymity. The subsequent section transitioned to a set of questions focused on PR factors and review practices. This section presented participants with two multiple-choice queries and a pair of questions grounded in the Likert-scale, enabling a structured feedback mechanism. Concluding the survey, the third section was crafted to prompt more detailed insights from the participants. It comprised two open-ended questions, providing an avenue for respondents to further describe their PR review experiences and techniques.

创建时间：

2023-10-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集