A Dataset of Bot and Human Activities in GitHub
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7740520
下载链接
链接失效反馈官方服务:
资源简介:
A Dataset of Bot and Human Activities in GitHub
This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.
The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.
Files description
The following files are provided as part of the archive:
bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
JsonSchema.json - A JSON schema that validates the above datasets;
bots.txt - A TEXT file containing login names of all the 350 bots
Example
Below is an example of a Closing pull request activity:
{
"date": "2022-11-25T18:49:09+00:00",
"activity": "Closing pull request",
"contributor": "typescript-bot",
"repository": "DefinitelyTyped/DefinitelyTyped",
"comment": {
"length": 249,
"GH_node": "IC_kwDOAFz6BM5PJG7l"
},
"pull_request": {
"id": 62328,
"title": "[qunit] Add `test.each()`",
"created_at": "2022-09-19T17:34:28+00:00",
"status": "closed",
"closed_at": "2022-11-25T18:49:08+00:00",
"merged": false,
"GH_node": "PR_kwDOAFz6BM4_N5ib"
},
"conversation": {
"comments": 19
},
"payload": {
"pr_commits": 1,
"pr_changed_files": 5
}
}
List of activity types
In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.
List of fields
Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.
For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.
Properties
date
Date on which the activity is performed
Type: string
e.g., "2022-11-25T09:55:19+00:00"
String format must be a "date-time"
activity
The activity performed by the contributor
Type: string
e.g., "Commenting pull request"
contributor
The login name of the contributor who performed this activity
Type: string
e.g., "analysis-bot", "anonymised" in the case of a human contributor
repository
The repository in which the activity is performed
Type: string
e.g., "apache/spark", "anonymised" in the case of a human contributor
issue
Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
Type: object
Properties
id
Issue number
Type: integer
e.g., 35471
title
Issue title
Type: string
e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
created_at
The date on which this issue is created
Type: string
e.g., "2022-11-10T13:07:23+00:00"
String format must be a "date-time"
status
Current state of the issue
Type: string
"open" or "closed"
closed_at
The date on which this issue is closed. "null" will be provided if the issue is open
Types: string, null
e.g., "2022-11-25T10:42:39+00:00"
String format must be a "date-time"
resolved
The issue is resolved or not_planned/still open
Type: boolean
true or false
GH_node
The GitHub node of this issue
Type: string
e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
pull_request
Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
Type: object
Properties
id
Pull request number
Type: integer
e.g., 35471
title
Pull request title
Type: string
e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
created_at
The date on which this pull request is created
Type: string
e.g., "2022-11-10T13:07:23+00:00"
String format must be a "date-time"
status
Current state of the pull request
Type: string
"open" or "closed"
closed_at
The date on which this pull request is closed. "null" will be provided if the pull request is open
Types: string, null
e.g., "2022-11-25T10:42:39+00:00"
String format must be a "date-time"
merged
The PR is merged or rejected/still open
Type: boolean
true or false
GH_node
The GitHub node of this pull request
Type: string
e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
review
Pull request review information - provided for Reviewing code
Type: object
Properties
status
Status of the review
Type: string
"changes_requested" or "approved" or "dismissed"
GH_node
The GitHub node of this review
Type: string
e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
conversation
Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
Type: object
Properties
comments
Number of comments present in the corresponding issue or pull request
Type: integer
e.g., 5
comment
Comment information - Provided for all the activities for which the field issue or pull_request is reported and additionally for commit comment
Type: object
Properties
length
Length of the comment text (or description text if comment is not provided)
Type: integer
e.g., 25
GH_node
The GitHub node of this comment or description. "null" will be provided if there is no comment expected
Types: string, null
e.g., "IC_kwDOEj6V8c5PHT78", "anonymised" in the case of a human contributor
gitref
Tag information - provided for Creating branch, Creating tag, Deleting branch, Deleting tag, Editing wiki page and Publishing a release
Type: object
Properties
type
Type of the gitref
Type: string
"tag" or "branch" or "commit"
name
Name of the gitref
Type: string
e.g., "cherry-pick-11-to-release-4.10"
description_length
Length of the description text provided while creating the gitref. "null" be provided if the type is "branch" or "commit" as they do not have any description
Type: integer, null
e.g., 23
release
Release information - provided for Publishing a release
Type: object
Properties
name
The name of the release that is created. "null" will be provided if the name is not provided
Type: string, null
e.g., "v0.65.9"
description_length
Length of the description of the release that is created
Type: integer
e.g., 888
created_at
The date at which the release is created (activity date is the release published date)
Type: string
e.g., "2022-11-25T11:34:48+00:00"
String format must be a "date-time"
prerelease
If the release that is created is a prerelease or not
Type: boolean
true or false
new_tag
If a new tag is created for this release or another tag is re-used
Type: boolean
true or false
GH_node
The corresponding release node ID
Type: string
e.g., "RE_kwDOCm6M2s4FBGxT", "anonymised" in the case of a human contributor
page
Page information - provided for Editing wiki page
Type: object
Properties
name
Name of the page
Type: string
e.g., "Workflow-status"
title
Title of the page
Type: string
e.g., "Workflow status"
new
If the page is created new or existing page is edited
Type: boolean
true or false
payload
Other additional details - Provided for Opening pull request, Closing pull request, Reopening pull request and pushing commits
Type: object
Properties
pr_commits
The number of commits in this pull request
Type: integer
e.g., 3
pr_changed_files
The number of files that are changed in this pull request
Type: integer
e.g., 2
pushed_commits
The number of commits present in this push
Type: integer
e.g., 4
distinct_pushed_commits
The distinct number of commits present in this push
Type: integer
e.g., 1
github_push_id
The corresponding GitHub push ID
Type: integer
e.g., 11790446870, "anonymised" in the case of a human contributor
Mapping between activities and events
For many activity types, the corresponding activity can be observed by the occurrence of a single event type. For example, the activity types Forking repository and Starring repository would require the occurrence of a single event type for each as given below.
Activity type
Event type
Payload
Forking repository
ForkEvent
-
Starring repository
WatchEvent
action = "started"
However, in some cases, the same event type yields different activity types depending on the value present in the payload. For example, three different activity types can be generated from the same low-level event type CreateEvent, depending on the value of its ref_type (either "repository", "branch", or "tag") present in the payload.
Activity type
Event type
Payload
Creating repository
CreateEvent
ref_type = "repository"
Creating branch
CreateEvent
ref_type = "branch"
Creating tag
CreateEvent
ref_type = "tag"
In some cases, there is no one-to-one mapping between events and activities. This is because some actions on GitHub may generate more than a single event and lead to a sequence of one mandatory event and a second optional event (marked with ?). For example, for the activity type Publishing a release, event type ReleaseEvent is mandatory with payload's action value = "published", while event type CreateEvent is optional as it is required only when a new tag is created along with the published release.
Activity type
Event type
Payload
Publishing a release
ReleaseEvent
action = "published"
? CreateEvent
ref_type = "tag"
All the identified activities along with their events type(s) and payload information is given in the following table.
Activity type
Event type
Payload
Creating repository
CreateEvent
ref_type = "repository"
Creating branch
CreateEvent
ref_type = "branch"
Creating tag
CreateEvent
ref_type = "tag"
Deleting tag
DeleteEvent
ref_type = "tag"
Deleting repository
DeleteEvent
ref_type = "branch"
Publishing a release
ReleaseEvent
action = "published"
? CreateEvent
ref_type = "tag"
Making repository public
PublicEvent
-
Adding collaborator to repository
MemberEvent
action = "added"
Forking repository
ForkEvent
-
Starring repository
WatchEvent
action = "started"
Editing wiki page
GollumEvent
pages-->action = "created" or "edited"
Opening issue
IssuesEvent
action = "opened"
Closing issue
IssuesEvent
action = "closed"
? IssueCommentEvent
action = "created"
Reopening issue
IssuesEvent
action = "reopened"
? IssueCommentEvent
action = "created"
Transferring issue
IssuesEvent
action = "opened"
Commenting issue
IssueCommentEvent
action = "created"
Opening pull request
PullRequestEvent
action = "opened"
Closing pull request
PullRequestEvent
action = "closed"
? IssueCommentEvent
action = "created"
Reopening pull request
PullRequestEvent
action = "opened"
? IssueCommentEvent
action = "created"
Commenting pull request
IssueCommentEvent
action = "created"
Commenting pull request changes
PullrequestReviewCommentEvent
action = "created"
? PullRequestReviewEvent
action = "created"
Reviewing code
PullRequestReviewEvent
action = "created"
Commenting commits
CommitCommentEvent
action = "created"
Pushing commits
PushEvent
-
创建时间:
2024-01-05



