Pull Request Review Comments Dataset
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4773067
下载链接
链接失效反馈官方服务:
资源简介:
Pull Request Review Comments (PRRC) Datasets
Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.
First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2020. It consisted of 54,021,838 PRRCs and is over 18 Gigabytes in size. It took over 120 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:
Pull Request Review Comments Dataset: This is the dataset of the comments from the latest100,000 threads in the source dataset from gharchive.
Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread (in chronological order).
Description
The dataset is stored in the JSONLines format, as was the source dataset from gharchive.
For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.
`comment_id` is the field which specifies the ID GitHub uses for that comment.
`commit_id` is the field which specifies the ID of the commit proposed in the pull request.
`url` is the field which specifies the url to the comment in a pull request thread.
`author` is the field which lists the username of the author of the pull request.
`created_at` is the field which specifies the time at which the pull request comment was created.
`body` is the field which describes the contents of the PRRC.
The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.
Construction
We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.
创建时间:
2021-06-23



