five

A Dataset of Bot and Human Activities in GitHub

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7740520
下载链接
链接失效反馈
官方服务:
资源简介:
A Dataset of Bot and Human Activities in GitHub This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL. The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development. Files description The following files are provided as part of the archive: bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors; human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized); JsonSchema.json - A JSON schema that validates the above datasets; bots.txt - A TEXT file containing login names of all the 350 bots Example Below is an example of a Closing pull request activity: { "date": "2022-11-25T18:49:09+00:00", "activity": "Closing pull request", "contributor": "typescript-bot", "repository": "DefinitelyTyped/DefinitelyTyped", "comment": { "length": 249, "GH_node": "IC_kwDOAFz6BM5PJG7l" }, "pull_request": { "id": 62328, "title": "[qunit] Add `test.each()`", "created_at": "2022-09-19T17:34:28+00:00", "status": "closed", "closed_at": "2022-11-25T18:49:08+00:00", "merged": false, "GH_node": "PR_kwDOAFz6BM4_N5ib" }, "conversation": { "comments": 19 }, "payload": { "pr_commits": 1, "pr_changed_files": 5 } } List of activity types In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits. List of fields Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on. For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets. Properties date Date on which the activity is performed Type: string e.g., "2022-11-25T09:55:19+00:00" String format must be a "date-time" activity The activity performed by the contributor Type: string e.g., "Commenting pull request" contributor The login name of the contributor who performed this activity Type: string e.g., "analysis-bot", "anonymised" in the case of a human contributor repository The repository in which the activity is performed Type: string e.g., "apache/spark", "anonymised" in the case of a human contributor issue Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue Type: object Properties id Issue number Type: integer e.g., 35471 title Issue title Type: string e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor created_at The date on which this issue is created Type: string e.g., "2022-11-10T13:07:23+00:00" String format must be a "date-time" status Current state of the issue Type: string "open" or "closed" closed_at The date on which this issue is closed. "null" will be provided if the issue is open Types: string, null e.g., "2022-11-25T10:42:39+00:00" String format must be a "date-time" resolved The issue is resolved or not_planned/still open Type: boolean true or false GH_node The GitHub node of this issue Type: string e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor pull_request Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code Type: object Properties id Pull request number Type: integer e.g., 35471 title Pull request title Type: string e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor created_at The date on which this pull request is created Type: string e.g., "2022-11-10T13:07:23+00:00" String format must be a "date-time" status Current state of the pull request Type: string "open" or "closed" closed_at The date on which this pull request is closed. "null" will be provided if the pull request is open Types: string, null e.g., "2022-11-25T10:42:39+00:00" String format must be a "date-time" merged The PR is merged or rejected/still open Type: boolean true or false GH_node The GitHub node of this pull request Type: string e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor review Pull request review information - provided for Reviewing code Type: object Properties status Status of the review Type: string "changes_requested" or "approved" or "dismissed" GH_node The GitHub node of this review Type: string e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor conversation Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request Type: object Properties comments Number of comments present in the corresponding issue or pull request Type: integer e.g., 5 comment Comment information - Provided for all the activities for which the field issue or pull_request is reported and additionally for commit comment Type: object Properties length Length of the comment text (or description text if comment is not provided) Type: integer e.g., 25 GH_node The GitHub node of this comment or description. "null" will be provided if there is no comment expected Types: string, null e.g., "IC_kwDOEj6V8c5PHT78", "anonymised" in the case of a human contributor gitref Tag information - provided for Creating branch, Creating tag, Deleting branch, Deleting tag, Editing wiki page and Publishing a release Type: object Properties type Type of the gitref Type: string "tag" or "branch" or "commit" name Name of the gitref Type: string e.g., "cherry-pick-11-to-release-4.10" description_length Length of the description text provided while creating the gitref. "null" be provided if the type is "branch" or "commit" as they do not have any description Type: integer, null e.g., 23 release Release information - provided for Publishing a release Type: object Properties name The name of the release that is created. "null" will be provided if the name is not provided Type: string, null e.g., "v0.65.9" description_length Length of the description of the release that is created Type: integer e.g., 888 created_at The date at which the release is created (activity date is the release published date) Type: string e.g., "2022-11-25T11:34:48+00:00" String format must be a "date-time" prerelease If the release that is created is a prerelease or not Type: boolean true or false new_tag If a new tag is created for this release or another tag is re-used Type: boolean true or false GH_node The corresponding release node ID Type: string e.g., "RE_kwDOCm6M2s4FBGxT", "anonymised" in the case of a human contributor page Page information - provided for Editing wiki page Type: object Properties name Name of the page Type: string e.g., "Workflow-status" title Title of the page Type: string e.g., "Workflow status" new If the page is created new or existing page is edited Type: boolean true or false payload Other additional details - Provided for Opening pull request, Closing pull request, Reopening pull request and pushing commits Type: object Properties pr_commits The number of commits in this pull request Type: integer e.g., 3 pr_changed_files The number of files that are changed in this pull request Type: integer e.g., 2 pushed_commits The number of commits present in this push Type: integer e.g., 4 distinct_pushed_commits The distinct number of commits present in this push Type: integer e.g., 1 github_push_id The corresponding GitHub push ID Type: integer e.g., 11790446870, "anonymised" in the case of a human contributor Mapping between activities and events For many activity types, the corresponding activity can be observed by the occurrence of a single event type. For example, the activity types Forking repository and Starring repository would require the occurrence of a single event type for each as given below. Activity type Event type Payload Forking repository ForkEvent - Starring repository WatchEvent action = "started" However, in some cases, the same event type yields different activity types depending on the value present in the payload. For example, three different activity types can be generated from the same low-level event type CreateEvent, depending on the value of its ref_type (either "repository", "branch", or "tag") present in the payload. Activity type Event type Payload Creating repository CreateEvent ref_type = "repository" Creating branch CreateEvent ref_type = "branch" Creating tag CreateEvent ref_type = "tag" In some cases, there is no one-to-one mapping between events and activities. This is because some actions on GitHub may generate more than a single event and lead to a sequence of one mandatory event and a second optional event (marked with ?). For example, for the activity type Publishing a release, event type ReleaseEvent is mandatory with payload's action value = "published", while event type CreateEvent is optional as it is required only when a new tag is created along with the published release. Activity type Event type Payload Publishing a release ReleaseEvent action = "published"   ? CreateEvent ref_type = "tag" All the identified activities along with their events type(s) and payload information is given in the following table. Activity type Event type Payload Creating repository CreateEvent ref_type = "repository" Creating branch CreateEvent ref_type = "branch" Creating tag CreateEvent ref_type = "tag" Deleting tag DeleteEvent ref_type = "tag" Deleting repository DeleteEvent ref_type = "branch" Publishing a release ReleaseEvent action = "published"   ? CreateEvent ref_type = "tag" Making repository public PublicEvent - Adding collaborator to repository MemberEvent action = "added" Forking repository ForkEvent - Starring repository WatchEvent action = "started" Editing wiki page GollumEvent pages-->action = "created" or "edited" Opening issue IssuesEvent action = "opened" Closing issue IssuesEvent action = "closed"   ? IssueCommentEvent action = "created" Reopening issue IssuesEvent action = "reopened"   ? IssueCommentEvent action = "created" Transferring issue IssuesEvent action = "opened" Commenting issue IssueCommentEvent action = "created" Opening pull request PullRequestEvent action = "opened" Closing pull request PullRequestEvent action = "closed"   ? IssueCommentEvent action = "created" Reopening pull request PullRequestEvent action = "opened"   ? IssueCommentEvent action = "created" Commenting pull request IssueCommentEvent action = "created" Commenting pull request changes PullrequestReviewCommentEvent action = "created"   ? PullRequestReviewEvent action = "created" Reviewing code PullRequestReviewEvent action = "created" Commenting commits CommitCommentEvent action = "created" Pushing commits PushEvent -
创建时间:
2024-01-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作