Data from: Can we make it better? Assessing and improving quality of GitHub repositories

Name: Data from: Can we make it better? Assessing and improving quality of GitHub repositories
Creator: SMU Research Data Repository (RDR)
Published: 2022-03-07 04:06:22
License: 暂无描述

DataCite Commons2022-03-07 更新2024-07-13 收录

下载链接：

https://researchdata.smu.edu.sg/articles/dataset/Data_from_Can_we_make_it_better_Assessing_and_improving_quality_of_GitHub_repositories/17073050

下载链接

链接失效反馈

官方服务：

资源简介：

This is the related dataset for the PhD dissertation by G. A. A. Prana, "Can We Make It Better? Assessing and Improving Quality of GitHub Repositories", available at https://ink.library.smu.edu.sg/etd_coll/373/<br>The code hosting platform GitHub has gained immense popularity worldwide in recent years, with over 200 million repositories hosted as of June 2021. Due to its popularity, it has great potential to facilitate widespread improvements across many software projects. Naturally, GitHub has attracted much research attention, and the source code in the various repositories it hosts also provide opportunity to apply techniques and tools developed by software engineering researchers over the years. However, much of existing body of research applicable to GitHub focuses on code quality of the software projects and ways to improve them. Fewer work focus on potential ways to improve quality of GitHub repositories through other aspects, although quality of a software project on GitHub is also affected by factors outside a project's source code, such as documentation, the project's dependencies, and pool of contributors.<br>The three works that form this dissertation focus on investigating aspects of GitHub repositories beyond the code quality, and identify specific potential improvements that can be applied to improve wide range of GitHub repositories. In the first work, we aim to systematically understand the content of README files in GitHub software projects, and develop a tool that can process them automatically. The work begins with a qualitative study involving 4,226 README file sections from 393 randomly-sampled GitHub repositories, which reveals that many README files contain the ``What'' and ``How'' of the software project, but often do not contain the purpose and status of the project. This is followed by a development and evaluation of a multi-label classifier that can predict eight different README content categories with F1 of 0.746. From our subsequent evaluation of the classifier, which involve twenty software professionals, we find that adding labels generated by the classifier to README files ease information discovery.<br>Our second work focuses on characteristics of vulnerabilities in open-source libraries used by 450 software projects on GitHub that are written in Java, Python, and Ruby. Using an industrial software composition analysis tool, we scanned every version of the projects after each commit made between November 1, 2017 and October 31, 2018. Our subsequent analyses on the discovered library names, versions, and associated vulnerabilities reveal, among others, that ``Denial of Service'' and ``Information Disclosure'' vulnerability types are common. In addition, we also find that most of the vulnerabilities persist throughout the observation period, and that attributes such as project size, project popularity, and experience level of commit authors do not translate to better or worse handling of vulnerabilities in dependent libraries. Based on the findings in the second work, we list a number of implications for library users, library developers, as well as researchers, and provide several concrete recommendations. This includes recommendations to simplify projects' dependency sets, as well as to encourage research into ways to automatically recommend libraries known to be secure to developers.<br>In our third work, we conduct a multi-region geographical analysis of gender inclusion on GitHub. We use a mixed-methods approach involving a quantitative analysis of commit authors of 21,456 project repositories, followed by a survey that is strategically targeted to developers in various regions worldwide and a qualitative analysis of the survey responses. Among other findings, we discover differences in diversity levels between regions, with Asia and Americas being highest. We also find no strong correlation between gender and geographic diversity of a repository's commit authors. Further, from our survey respondents worldwide, we also identify barriers and motivations to contribute to open-source software. The results of this work provides insights on the current state of gender diversity in open source software and potential ways to improve participation of developers from under-represented regions and gender, and subsequently improve the open-source software community in general. Such potential ways include creation of codes of conduct, proximity-based mentorship schemes, and highlighting of women / regional role models.

本数据集为G. A. A. Prana博士学位论文《能否更进一步？评估与优化GitHub仓库质量（Can We Make It Better? Assessing and Improving Quality of GitHub Repositories）》的配套数据集，论文原文可访问：https://ink.library.smu.edu.sg/etd_coll/373/ 近年来，代码托管平台GitHub在全球范围内收获了极高的关注度，截至2021年6月，其平台上托管的仓库数量已突破2亿。凭借庞大的用户体量，GitHub具备推动海量软件项目优化升级的巨大潜力，自然也吸引了大量研究目光，平台内各类仓库的源代码也为软件工程研究者多年来开发的各类技术与工具提供了应用场景。不过，当前绝大多数针对GitHub的研究都聚焦于软件项目的代码质量及其优化路径，而针对GitHub仓库质量的其他优化方向的研究相对较少——尽管GitHub上软件项目的质量，除了受源代码影响外，还会受到文档、项目依赖、贡献者群体等诸多非代码因素的影响。本博士论文包含三项研究工作，均围绕GitHub仓库的非代码质量维度展开，旨在发掘可用于优化各类GitHub仓库的具体可行路径。第一项研究旨在系统性梳理GitHub软件项目README（Read Me）文件的内容结构，并开发一款可自动处理此类文件的工具。本研究首先对随机抽取的393个GitHub仓库中的4226个README文件片段开展质性分析，结果显示，多数README文件会涵盖软件项目的“功能特性”与“使用方法”，但往往缺失项目的核心目标与当前状态。随后，本研究开发并评估了一款多标签分类器，该分类器可预测README文件的8类不同内容范畴，其F1值可达0.746。后续通过20名软件行业从业者对该分类器开展的评估结果显示，为README文件添加分类器生成的标签，可有效降低用户的信息检索难度。第二项研究聚焦于GitHub上450个使用Java、Python及Ruby语言开发的软件项目所依赖的开源库的漏洞特征。本研究借助一款工业级软件成分分析（Software Composition Analysis）工具，对2017年11月1日至2018年10月31日期间，项目每次提交后的所有版本开展了漏洞扫描。针对扫描得到的库名称、版本及关联漏洞的后续分析显示，“拒绝服务（Denial of Service）”与“信息泄露（Information Disclosure）”是较为常见的漏洞类型。此外，本研究还发现，多数漏洞在整个观测周期内持续存在；而项目规模、项目热度、提交作者的经验水平等属性，均未对依赖库的漏洞处理效果产生显著的正向或负向影响。基于第二项研究的发现，本研究为库使用者、库开发者及研究者分别提出了多项实践启示，并给出了具体建议，包括简化项目的依赖集、鼓励研发可自动为开发者推荐安全合规依赖库的相关技术等。第三项研究针对GitHub平台上的性别包容性开展多区域地理分析。本研究采用混合研究方法：首先对21456个项目仓库的提交作者开展量化分析，随后针对全球各区域的开发者开展定向问卷调查，并对问卷结果开展质性分析。研究结果显示，各区域的性别多样性水平存在显著差异，其中亚洲与美洲的多样性程度最高；同时未发现仓库提交作者的性别与地理多样性之间存在强相关性。此外，通过对全球受访者的调研，本研究还识别出了开发者参与开源软件开发的阻碍因素与动机。本研究的结果可为开源软件领域的性别多样性现状提供洞察，并为提升少数群体与欠发达区域开发者的参与度、进而全面优化开源软件社区提供可行路径，例如制定行为准则、开展基于地域邻近的导师帮扶计划、宣传女性/区域行业榜样等。

提供机构：

SMU Research Data Repository (RDR)

创建时间：

2021-12-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集