AISE-TUDelft/MOSAIC-agentic-3m
收藏Hugging Face2026-04-01 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/AISE-TUDelft/MOSAIC-agentic-3m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: gpl-3.0
dataset_info:
- config_name: Comments_Claude
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 55857538
num_examples: 22329
download_size: 20512270
dataset_size: 55857538
- config_name: Comments_Codex
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 4620303
num_examples: 3693
download_size: 1321158
dataset_size: 4620303
- config_name: Comments_Copilot
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 32991018
num_examples: 26664
download_size: 10981731
dataset_size: 32991018
- config_name: Comments_Devin
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 25809640
num_examples: 27518
download_size: 5950117
dataset_size: 25809640
- config_name: Comments_Human
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 23905765
num_examples: 18559
download_size: 8273823
dataset_size: 23905765
- config_name: Comments_Jules
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: minimized_reason
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 6179000
num_examples: 5700
download_size: 1741424
dataset_size: 6179000
- config_name: Commits_Claude
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 78328751
num_examples: 82755
download_size: 34360149
dataset_size: 78328751
- config_name: Commits_Codex
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 13017738
num_examples: 27530
download_size: 6457855
dataset_size: 13017738
- config_name: Commits_Copilot
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 41974158
num_examples: 69896
download_size: 14679965
dataset_size: 41974158
- config_name: Commits_Devin
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 45600275
num_examples: 51641
download_size: 17402189
dataset_size: 45600275
- config_name: Commits_Human
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 54514575
num_examples: 102037
download_size: 22855222
dataset_size: 54514575
- config_name: Commits_Jules
features:
- name: id
dtype: string
- name: sha
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: committed_date
dtype: string
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: authored_date
dtype: string
- name: message_body
dtype: string
- name: message_headline
dtype: string
- name: author_count
dtype: int64
- name: committer
struct:
- name: email
dtype: string
- name: name
dtype: string
- name: changed_files
dtype: int64
- name: authors
list:
- name: email
dtype: string
- name: name
dtype: string
splits:
- name: train
num_bytes: 39445671
num_examples: 41032
download_size: 16332003
dataset_size: 39445671
- config_name: Issues_Claude
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: bool
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: int64
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 8371776
num_examples: 4052
download_size: 3940490
dataset_size: 8371776
- config_name: Issues_Codex
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: bool
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: 'null'
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 57016
num_examples: 45
download_size: 42463
dataset_size: 57016
- config_name: Issues_Copilot
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: bool
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: int64
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 18040689
num_examples: 9744
download_size: 7358053
dataset_size: 18040689
- config_name: Issues_Devin
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: 'null'
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: 'null'
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 502230
num_examples: 294
download_size: 244261
dataset_size: 502230
- config_name: Issues_Human
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: bool
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: int64
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 2590797
num_examples: 1973
download_size: 1244937
dataset_size: 2590797
- config_name: Issues_Jules
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: locked
dtype: bool
- name: number
dtype: int64
- name: state
dtype: string
- name: tracked_issues_count
dtype: 'null'
- name: label_count
dtype: int64
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: updated_at
dtype: string
- name: issue_type
struct:
- name: description
dtype: string
- name: name
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: state_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: pr_ids
dtype: 'null'
- name: prs_closing_issue
dtype: int64
splits:
- name: train
num_bytes: 4430669
num_examples: 2185
download_size: 1204850
dataset_size: 4430669
- config_name: PullRequests_Claude
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: bool
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 60914013
num_examples: 19148
download_size: 25317639
dataset_size: 60914013
- config_name: PullRequests_Codex
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: bool
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 27512018
num_examples: 20835
download_size: 10591468
dataset_size: 27512018
- config_name: PullRequests_Copilot
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: bool
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 82066535
num_examples: 18563
download_size: 35017107
dataset_size: 82066535
- config_name: PullRequests_Devin
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: 'null'
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 63944576
num_examples: 14045
download_size: 25737087
dataset_size: 63944576
- config_name: PullRequests_Human
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: bool
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 46594555
num_examples: 20910
download_size: 18790055
dataset_size: 46594555
- config_name: PullRequests_Jules
features:
- name: id
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: number
dtype: int64
- name: body
dtype: string
- name: state
dtype: string
- name: created_at
dtype: string
- name: is_draft
dtype: bool
- name: changed_files
dtype: int64
- name: is_cross_repository
dtype: 'null'
- name: locked
dtype: bool
- name: is_in_merge_queue
dtype: 'null'
- name: additions
dtype: int64
- name: deletions
dtype: int64
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: label_count
dtype: int64
- name: base_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: head_repository
struct:
- name: id
dtype: string
- name: name
dtype: string
- name: url
dtype: string
- name: timeline_count
dtype: int64
- name: merged_at
dtype: string
- name: closed_at
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: review_decision
dtype: string
- name: head_ref_name
dtype: string
- name: head_ref_oid
dtype: string
- name: timeline_items
dtype: 'null'
- name: base_ref_name
dtype: string
- name: base_ref_oid
dtype: string
- name: comments_count
dtype: int64
- name: reviews_count
dtype: int64
- name: commits_count
dtype: int64
- name: files
list:
- name: additions
dtype: int64
- name: change_type
dtype: string
- name: deletions
dtype: int64
- name: path
dtype: string
- name: assignees_count
dtype: int64
- name: closing_issues_count
dtype: int64
- name: author_association
dtype: string
- name: labels
list:
- name: description
dtype: string
- name: name
dtype: string
- name: active_lock_reason
dtype: string
splits:
- name: train
num_bytes: 28962080
num_examples: 18468
download_size: 9758320
dataset_size: 28962080
- config_name: Repositories_Claude
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 20426354
num_examples: 38260
download_size: 2499442
dataset_size: 20426354
- config_name: Repositories_Codex
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 20955633
num_examples: 41669
download_size: 2882232
dataset_size: 20955633
- config_name: Repositories_Copilot
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 19802485
num_examples: 37125
download_size: 2756730
dataset_size: 19802485
- config_name: Repositories_Devin
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 14911226
num_examples: 28090
download_size: 1237029
dataset_size: 14911226
- config_name: Repositories_Human
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 22804151
num_examples: 41542
download_size: 5429443
dataset_size: 22804151
- config_name: Repositories_Jules
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: role
dtype: string
- name: name
dtype: string
- name: name_with_owner
dtype: string
- name: url
dtype: string
- name: ssh_url
dtype: string
- name: stargazer_count
dtype: int64
- name: is_fork
dtype: bool
- name: is_archived
dtype: bool
- name: is_disabled
dtype: 'null'
- name: is_empty
dtype: 'null'
- name: is_in_organization
dtype: bool
- name: is_locked
dtype: 'null'
- name: is_private
dtype: 'null'
- name: is_mirror
dtype: 'null'
- name: is_template
dtype: bool
- name: is_user_configuration_repository
dtype: bool
- name: fork_count
dtype: int64
- name: forking_allowed
dtype: bool
- name: created_at
dtype: string
- name: visibility
dtype: string
- name: owner
struct:
- name: id
dtype: string
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
- name: topics_count
dtype: int64
- name: languages
list: string
- name: language_count
dtype: int64
- name: watchers
dtype: int64
- name: license_info
dtype: string
- name: default_brach
dtype: string
- name: repository_topics
list: string
- name: primary_language
dtype: string
- name: lock_reason
dtype: 'null'
- name: pushed_at
dtype: string
- name: updated_at
dtype: string
- name: archived_at
dtype: string
- name: description
dtype: string
splits:
- name: train
num_bytes: 18591214
num_examples: 36936
download_size: 1962081
dataset_size: 18591214
- config_name: Reviews_Claude
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 46894755
num_examples: 12728
download_size: 17078595
dataset_size: 46894755
- config_name: Reviews_Codex
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 3656182
num_examples: 1957
download_size: 1271022
dataset_size: 3656182
- config_name: Reviews_Copilot
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 13185210
num_examples: 20665
download_size: 4235301
dataset_size: 13185210
- config_name: Reviews_Devin
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 5296907
num_examples: 6901
download_size: 1598524
dataset_size: 5296907
- config_name: Reviews_Human
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 17970167
num_examples: 21401
download_size: 6705393
dataset_size: 17970167
- config_name: Reviews_Jules
features:
- name: id
dtype: string
- name: pr_id
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: created_at
dtype: string
- name: is_minimized
dtype: bool
- name: state
dtype: string
- name: updated_at
dtype: string
- name: last_edited_at
dtype: string
- name: published_at
dtype: string
- name: submitted_at
dtype: string
- name: minimized_reason
dtype: string
- name: author
struct:
- name: id
dtype: 'null'
- name: login
dtype: string
- name: name
dtype: 'null'
- name: typename
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 2724226
num_examples: 3249
download_size: 1035607
dataset_size: 2724226
configs:
- config_name: Comments_Claude
data_files:
- split: train
path: data/Claude/Comments/train-*
- config_name: Comments_Codex
data_files:
- split: train
path: data/Codex/Comments/train-*
- config_name: Comments_Copilot
data_files:
- split: train
path: data/Copilot/Comments/train-*
- config_name: Comments_Devin
data_files:
- split: train
path: data/Devin/Comments/train-*
- config_name: Comments_Human
data_files:
- split: train
path: data/Human/Comments/train-*
- config_name: Comments_Jules
data_files:
- split: train
path: data/Jules/Comments/train-*
- config_name: Commits_Claude
data_files:
- split: train
path: data/Claude/Commits/train-*
- config_name: Commits_Codex
data_files:
- split: train
path: data/Codex/Commits/train-*
- config_name: Commits_Copilot
data_files:
- split: train
path: data/Copilot/Commits/train-*
- config_name: Commits_Devin
data_files:
- split: train
path: data/Devin/Commits/train-*
- config_name: Commits_Human
data_files:
- split: train
path: data/Human/Commits/train-*
- config_name: Commits_Jules
data_files:
- split: train
path: data/Jules/Commits/train-*
- config_name: Issues_Claude
data_files:
- split: train
path: data/Claude/Issues/train-*
- config_name: Issues_Codex
data_files:
- split: train
path: data/Codex/Issues/train-*
- config_name: Issues_Copilot
data_files:
- split: train
path: data/Copilot/Issues/train-*
- config_name: Issues_Devin
data_files:
- split: train
path: data/Devin/Issues/train-*
- config_name: Issues_Human
data_files:
- split: train
path: data/Human/Issues/train-*
- config_name: Issues_Jules
data_files:
- split: train
path: data/Jules/Issues/train-*
- config_name: PullRequests_Claude
data_files:
- split: train
path: data/Claude/PullRequests/train-*
- config_name: PullRequests_Codex
data_files:
- split: train
path: data/Codex/PullRequests/train-*
- config_name: PullRequests_Copilot
data_files:
- split: train
path: data/Copilot/PullRequests/train-*
- config_name: PullRequests_Devin
data_files:
- split: train
path: data/Devin/PullRequests/train-*
- config_name: PullRequests_Human
data_files:
- split: train
path: data/Human/PullRequests/train-*
- config_name: PullRequests_Jules
data_files:
- split: train
path: data/Jules/PullRequests/train-*
- config_name: Repositories_Claude
data_files:
- split: train
path: data/Claude/Repositories/train-*
- config_name: Repositories_Codex
data_files:
- split: train
path: data/Codex/Repositories/train-*
- config_name: Repositories_Copilot
data_files:
- split: train
path: data/Copilot/Repositories/train-*
- config_name: Repositories_Devin
data_files:
- split: train
path: data/Devin/Repositories/train-*
- config_name: Repositories_Human
data_files:
- split: train
path: data/Human/Repositories/train-*
- config_name: Repositories_Jules
data_files:
- split: train
path: data/Jules/Repositories/train-*
- config_name: Reviews_Claude
data_files:
- split: train
path: data/Claude/Reviews/train-*
- config_name: Reviews_Codex
data_files:
- split: train
path: data/Codex/Reviews/train-*
- config_name: Reviews_Copilot
data_files:
- split: train
path: data/Copilot/Reviews/train-*
- config_name: Reviews_Devin
data_files:
- split: train
path: data/Devin/Reviews/train-*
- config_name: Reviews_Human
data_files:
- split: train
path: data/Human/Reviews/train-*
- config_name: Reviews_Jules
data_files:
- split: train
path: data/Jules/Reviews/train-*
---
# Agent Activity Dataset
This dataset is released in conjunction with the paper **Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time**, accepted at [**MSR 2026**](https://2026.msrconf.org/details/msr-2026-technical-papers/21/Investigating-Autonomous-Agent-Contributions-in-the-Wild-Activity-Patterns-and-Code-).
## Dataset Overview
The dataset contains a total of **111,969** Pull Requests **(June through August 2025)** from both coding agents (**Claude Code**, **OpenAI Codex**, **GitHub Copilot**, **Google Jules**, and **Devin**) and **human** contributors. It also includes additional activity metadata such as repositories, commits, comments, issues, reviews, and modified files. A summary of the dataset is presented below.
| PR Author | #PR | #Repository | #Commit | #Comment | #Review | #Issue | #Changed File |
|----------------|--------|-------------|---------|----------|---------|--------|---------------|
| OpenAI Codex | 20,835 | 41,669 | 27,530 | 3,693 | 1,957 | 45 | 90,822 |
| Claude Code | 19,148 | 38,260 | 82,755 | 22,329 | 12,728 | 4,052 | 255,275 |
| GitHub Copilot | 18,563 | 37,125 | 69,896 | 26,664 | 20,665 | 9,744 | 158,404 |
| Google Jules | 18,468 | 36,936 | 41,032 | 5,700 | 3,249 | 2,185 | 138,610 |
| Devin | 14,045 | 28,090 | 51,641 | 27,518 | 6,901 | 294 | 131,454 |
| Human | 20,910 | 41,542 | 102,037 | 18,559 | 21,401 | 1,973 | 194,861 |
## Dataset Structure
The schema of the dataset is shown below. Solid lines indicate entities, while dotted lines represent nested objects.

- **Pull Request**: records the content, state, and activity of a pull request, including author, repository references, timestamps, and total number of commits, reviews, comments, closed issues, labels, and files changed.
- **Repository**: stores a repository's ownership, visibility, status flags, popularity metrics, programming languages, topics, licensing, timestamps, and descriptive information.
- **Commit**: captures a commit's identity, content, timestamps, authoring and committing information, changed files, and associated authors for a given pull request.
- **Review**: lists a pull request review, including its identifier, author, content, state, timestamps, and minimization status.
- **Comment**: represents a pull request comment with its identifier, author, content, timestamps, publication status, and minimization details.
- **Issue**: stores information about an issue linked to a pull request, including its identifier, author, title, description, state, timestamps, type, labels, and other associated PRs.
## Dataset Usage
Example loading by configuration, *pull requests*, *repositories*, *commits*, *comments*, *reviews*, and *issues* for *Claude*. The same applies for the other agents with configuration names: *Codex*, *Copilot*, *Devin*, *Jules*, *Human*.
```python
claude_pullrequests = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'PullRequests_Claude', split='train')
claude_repositories = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Repositories_Claude', split='train')
claude_commits = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Commits_Claude', split='train')
claude_comments = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Comments_Claude', split='train')
claude_reviews = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Reviews_Claude', split='train')
claude_issues = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Issues_Claude', split='train')
```
Example loading by data directory for **Claude**. The same applies for the other agents with configuration names: *Codex*, *Copilot*, *Devin*, *Jules*, *Human*.
```python
claude_pullrequests = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/PullRequests', split='train')
claude_repositories = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Repositories', split='train')
claude_commits = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Commits', split='train')
claude_comments = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Comments', split='train')
claude_reviews = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Reviews', split='train')
claude_issues = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Issues', split='train')
```
提供机构:
AISE-TUDelft
搜集汇总
数据集介绍

构建方式
在软件工程与人工智能交叉领域,MOSAIC-agentic-3m数据集通过系统化采集GitHub平台上的协作数据构建而成。其核心方法涉及从多个知名AI代理(如Claude、Codex、Copilot、Devin、Jules)及人类开发者相关的Pull Requests、Issues、Commits和Comments中提取结构化信息。数据收集过程依托于版本控制系统的元数据,确保了每条记录的来源可追溯,并通过统一的架构对不同类型的数据条目进行标准化处理,从而形成涵盖多维度软件开发活动的综合性语料库。
特点
该数据集最显著的特征在于其细粒度的多模态结构,不仅区分了人类与多种AI代理的贡献,还按照软件开发的不同环节(如代码提交、问题追踪、代码审查)进行了分类。每个配置均包含丰富的元数据字段,例如作者信息、时间戳、修改内容统计及关联标识符,为分析AI辅助编程的行为模式提供了详实的上下文。数据规模的差异性反映了不同代理在实际项目中的参与程度,这种自然分布增强了数据集的真实性与代表性。
使用方法
研究人员可利用该数据集进行多方面的实证分析,例如通过对比人类与AI代理在代码提交信息、问题讨论或审查评论中的语言特征与行为模式,探索AI在协作开发中的影响。数据集支持以配置为单位进行加载,便于针对特定代理或活动类型开展专项研究。典型应用场景包括训练或评估代码生成模型、研究人机协作动力学,或作为基准测试数据用于检测AI生成内容的特征。使用时应遵循GPL-3.0许可协议,并注意不同子集在样本量上的差异,以确保分析结果的稳健性。
背景与挑战
背景概述
在人工智能代理技术迅猛发展的背景下,MOSAIC-agentic-3m数据集应运而生,旨在为智能体协作与代码生成研究提供大规模、细粒度的真实世界交互数据。该数据集由相关研究机构于近期构建,聚焦于探索多智能体在软件开发流程中的协同行为模式,核心研究问题涉及智能体间通信效率、任务分配策略以及代码贡献质量的量化评估。通过整合来自Claude、Codex、Copilot、Devin、Jules及人类开发者的拉取请求、提交记录、问题报告和评论等多模态数据,该数据集为理解智能体在复杂工程环境中的自主性与适应性奠定了实证基础,对推动软件工程智能化与多智能体系统研究具有显著影响力。
当前挑战
该数据集致力于解决智能体在软件开发协作中行为识别与性能评估的挑战,具体包括区分不同智能体生成内容的风格特征、量化智能体贡献的代码质量与创新性,以及建模多智能体交互的动态复杂性。在构建过程中,面临数据采集与清洗的严峻挑战,例如从GitHub等平台提取海量异构数据时需处理隐私与许可问题,确保数据标注的准确性与一致性,同时克服不同智能体输出格式的差异性与时间戳对齐的困难,以构建高质量、可复用的基准数据集。
常用场景
经典使用场景
在软件工程与人工智能交叉领域,MOSAIC-agentic-3m数据集为研究智能体驱动的代码协作行为提供了关键资源。该数据集经典地应用于训练和评估大语言模型在代码审查、提交信息生成和问题跟踪等任务上的表现。通过整合来自Claude、Codex、Copilot、Devin、Jules及人类开发者的多源数据,研究者能够深入分析智能体与人类在软件开发流程中的交互模式差异,为构建更高效的自动化编程助手奠定数据基础。
解决学术问题
该数据集有效解决了智能体行为可解释性、代码生成质量评估以及人机协作效率量化等核心学术问题。通过提供结构化、标注清晰的代码仓库活动记录,研究者能够系统性地探究智能体在真实开发环境中的决策逻辑与输出特性。其意义在于突破了以往仿真环境或小规模数据集的局限,为验证智能体在复杂软件工程场景中的实际效能提供了实证基础,推动了自动化编程研究从理论探索向实践验证的范式转变。
衍生相关工作
围绕该数据集已衍生出多项经典研究工作,例如基于多智能体行为对比的代码质量评估框架、融合时序信息的开发活动预测模型以及针对智能体生成内容的可信度验证方法。这些工作不仅深化了对智能体编码行为模式的理解,还催生了如自动化代码审查助手、智能提交信息生成器等一系列创新工具。相关研究进一步拓展至软件工程教育、开源社区治理等领域,形成了以数据驱动的智能软件开发方法论体系。
以上内容由遇见数据集搜集并总结生成



