88.6 Million Developer Comments from GitHub

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/5596536

下载链接

链接失效反馈

官方服务：

资源简介：

Description This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes: 54,252,380 issue comments (from 13,458,208 issues) 979,642 commit comments (from 49,710,108 commits) 33,408,215 pull request comments (from 12,680,373 pull requests) Warning: The uploaded dataset is compressed from 185GB down to 25.1GB. Purpose The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers. Collection Process Full implementation details can be found in the following publication: Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023. Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria: At least 850 stars. Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021. Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria. After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments. 88_million_developer_comments.zip This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests). Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL. Details on the columns in each CSV file are described in the provided README.md. Detailed_Breakdown.ods This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip. Note On Completeness We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here. Versioning v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed. v1.0: Original corpus. Contact Please contact Benjamin S. Meyers (email) with questions about this data and its collection. Acknowledgments Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018). This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15

创建时间：

2024-01-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集