five

AISE-TUDelft/MOSAIC-agentic-3m

收藏
Hugging Face2026-04-01 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/AISE-TUDelft/MOSAIC-agentic-3m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-3.0 dataset_info: - config_name: Comments_Claude features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 55857538 num_examples: 22329 download_size: 20512270 dataset_size: 55857538 - config_name: Comments_Codex features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 4620303 num_examples: 3693 download_size: 1321158 dataset_size: 4620303 - config_name: Comments_Copilot features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 32991018 num_examples: 26664 download_size: 10981731 dataset_size: 32991018 - config_name: Comments_Devin features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 25809640 num_examples: 27518 download_size: 5950117 dataset_size: 25809640 - config_name: Comments_Human features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 23905765 num_examples: 18559 download_size: 8273823 dataset_size: 23905765 - config_name: Comments_Jules features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: minimized_reason dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 6179000 num_examples: 5700 download_size: 1741424 dataset_size: 6179000 - config_name: Commits_Claude features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 78328751 num_examples: 82755 download_size: 34360149 dataset_size: 78328751 - config_name: Commits_Codex features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 13017738 num_examples: 27530 download_size: 6457855 dataset_size: 13017738 - config_name: Commits_Copilot features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 41974158 num_examples: 69896 download_size: 14679965 dataset_size: 41974158 - config_name: Commits_Devin features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 45600275 num_examples: 51641 download_size: 17402189 dataset_size: 45600275 - config_name: Commits_Human features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 54514575 num_examples: 102037 download_size: 22855222 dataset_size: 54514575 - config_name: Commits_Jules features: - name: id dtype: string - name: sha dtype: string - name: pr_id dtype: string - name: url dtype: string - name: committed_date dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: authored_date dtype: string - name: message_body dtype: string - name: message_headline dtype: string - name: author_count dtype: int64 - name: committer struct: - name: email dtype: string - name: name dtype: string - name: changed_files dtype: int64 - name: authors list: - name: email dtype: string - name: name dtype: string splits: - name: train num_bytes: 39445671 num_examples: 41032 download_size: 16332003 dataset_size: 39445671 - config_name: Issues_Claude features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: bool - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: int64 - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 8371776 num_examples: 4052 download_size: 3940490 dataset_size: 8371776 - config_name: Issues_Codex features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: bool - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: 'null' - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 57016 num_examples: 45 download_size: 42463 dataset_size: 57016 - config_name: Issues_Copilot features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: bool - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: int64 - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 18040689 num_examples: 9744 download_size: 7358053 dataset_size: 18040689 - config_name: Issues_Devin features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: 'null' - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: 'null' - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 502230 num_examples: 294 download_size: 244261 dataset_size: 502230 - config_name: Issues_Human features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: bool - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: int64 - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 2590797 num_examples: 1973 download_size: 1244937 dataset_size: 2590797 - config_name: Issues_Jules features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: title dtype: string - name: body dtype: string - name: created_at dtype: string - name: locked dtype: bool - name: number dtype: int64 - name: state dtype: string - name: tracked_issues_count dtype: 'null' - name: label_count dtype: int64 - name: last_edited_at dtype: string - name: published_at dtype: string - name: updated_at dtype: string - name: issue_type struct: - name: description dtype: string - name: name dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: state_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: pr_ids dtype: 'null' - name: prs_closing_issue dtype: int64 splits: - name: train num_bytes: 4430669 num_examples: 2185 download_size: 1204850 dataset_size: 4430669 - config_name: PullRequests_Claude features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: bool - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 60914013 num_examples: 19148 download_size: 25317639 dataset_size: 60914013 - config_name: PullRequests_Codex features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: bool - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 27512018 num_examples: 20835 download_size: 10591468 dataset_size: 27512018 - config_name: PullRequests_Copilot features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: bool - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 82066535 num_examples: 18563 download_size: 35017107 dataset_size: 82066535 - config_name: PullRequests_Devin features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: 'null' - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 63944576 num_examples: 14045 download_size: 25737087 dataset_size: 63944576 - config_name: PullRequests_Human features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: bool - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 46594555 num_examples: 20910 download_size: 18790055 dataset_size: 46594555 - config_name: PullRequests_Jules features: - name: id dtype: string - name: title dtype: string - name: url dtype: string - name: number dtype: int64 - name: body dtype: string - name: state dtype: string - name: created_at dtype: string - name: is_draft dtype: bool - name: changed_files dtype: int64 - name: is_cross_repository dtype: 'null' - name: locked dtype: bool - name: is_in_merge_queue dtype: 'null' - name: additions dtype: int64 - name: deletions dtype: int64 - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: label_count dtype: int64 - name: base_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: head_repository struct: - name: id dtype: string - name: name dtype: string - name: url dtype: string - name: timeline_count dtype: int64 - name: merged_at dtype: string - name: closed_at dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: review_decision dtype: string - name: head_ref_name dtype: string - name: head_ref_oid dtype: string - name: timeline_items dtype: 'null' - name: base_ref_name dtype: string - name: base_ref_oid dtype: string - name: comments_count dtype: int64 - name: reviews_count dtype: int64 - name: commits_count dtype: int64 - name: files list: - name: additions dtype: int64 - name: change_type dtype: string - name: deletions dtype: int64 - name: path dtype: string - name: assignees_count dtype: int64 - name: closing_issues_count dtype: int64 - name: author_association dtype: string - name: labels list: - name: description dtype: string - name: name dtype: string - name: active_lock_reason dtype: string splits: - name: train num_bytes: 28962080 num_examples: 18468 download_size: 9758320 dataset_size: 28962080 - config_name: Repositories_Claude features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 20426354 num_examples: 38260 download_size: 2499442 dataset_size: 20426354 - config_name: Repositories_Codex features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 20955633 num_examples: 41669 download_size: 2882232 dataset_size: 20955633 - config_name: Repositories_Copilot features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 19802485 num_examples: 37125 download_size: 2756730 dataset_size: 19802485 - config_name: Repositories_Devin features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 14911226 num_examples: 28090 download_size: 1237029 dataset_size: 14911226 - config_name: Repositories_Human features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 22804151 num_examples: 41542 download_size: 5429443 dataset_size: 22804151 - config_name: Repositories_Jules features: - name: id dtype: string - name: pr_id dtype: string - name: role dtype: string - name: name dtype: string - name: name_with_owner dtype: string - name: url dtype: string - name: ssh_url dtype: string - name: stargazer_count dtype: int64 - name: is_fork dtype: bool - name: is_archived dtype: bool - name: is_disabled dtype: 'null' - name: is_empty dtype: 'null' - name: is_in_organization dtype: bool - name: is_locked dtype: 'null' - name: is_private dtype: 'null' - name: is_mirror dtype: 'null' - name: is_template dtype: bool - name: is_user_configuration_repository dtype: bool - name: fork_count dtype: int64 - name: forking_allowed dtype: bool - name: created_at dtype: string - name: visibility dtype: string - name: owner struct: - name: id dtype: string - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string - name: topics_count dtype: int64 - name: languages list: string - name: language_count dtype: int64 - name: watchers dtype: int64 - name: license_info dtype: string - name: default_brach dtype: string - name: repository_topics list: string - name: primary_language dtype: string - name: lock_reason dtype: 'null' - name: pushed_at dtype: string - name: updated_at dtype: string - name: archived_at dtype: string - name: description dtype: string splits: - name: train num_bytes: 18591214 num_examples: 36936 download_size: 1962081 dataset_size: 18591214 - config_name: Reviews_Claude features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 46894755 num_examples: 12728 download_size: 17078595 dataset_size: 46894755 - config_name: Reviews_Codex features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 3656182 num_examples: 1957 download_size: 1271022 dataset_size: 3656182 - config_name: Reviews_Copilot features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 13185210 num_examples: 20665 download_size: 4235301 dataset_size: 13185210 - config_name: Reviews_Devin features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 5296907 num_examples: 6901 download_size: 1598524 dataset_size: 5296907 - config_name: Reviews_Human features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 17970167 num_examples: 21401 download_size: 6705393 dataset_size: 17970167 - config_name: Reviews_Jules features: - name: id dtype: string - name: pr_id dtype: string - name: url dtype: string - name: body dtype: string - name: created_at dtype: string - name: is_minimized dtype: bool - name: state dtype: string - name: updated_at dtype: string - name: last_edited_at dtype: string - name: published_at dtype: string - name: submitted_at dtype: string - name: minimized_reason dtype: string - name: author struct: - name: id dtype: 'null' - name: login dtype: string - name: name dtype: 'null' - name: typename dtype: string - name: url dtype: string splits: - name: train num_bytes: 2724226 num_examples: 3249 download_size: 1035607 dataset_size: 2724226 configs: - config_name: Comments_Claude data_files: - split: train path: data/Claude/Comments/train-* - config_name: Comments_Codex data_files: - split: train path: data/Codex/Comments/train-* - config_name: Comments_Copilot data_files: - split: train path: data/Copilot/Comments/train-* - config_name: Comments_Devin data_files: - split: train path: data/Devin/Comments/train-* - config_name: Comments_Human data_files: - split: train path: data/Human/Comments/train-* - config_name: Comments_Jules data_files: - split: train path: data/Jules/Comments/train-* - config_name: Commits_Claude data_files: - split: train path: data/Claude/Commits/train-* - config_name: Commits_Codex data_files: - split: train path: data/Codex/Commits/train-* - config_name: Commits_Copilot data_files: - split: train path: data/Copilot/Commits/train-* - config_name: Commits_Devin data_files: - split: train path: data/Devin/Commits/train-* - config_name: Commits_Human data_files: - split: train path: data/Human/Commits/train-* - config_name: Commits_Jules data_files: - split: train path: data/Jules/Commits/train-* - config_name: Issues_Claude data_files: - split: train path: data/Claude/Issues/train-* - config_name: Issues_Codex data_files: - split: train path: data/Codex/Issues/train-* - config_name: Issues_Copilot data_files: - split: train path: data/Copilot/Issues/train-* - config_name: Issues_Devin data_files: - split: train path: data/Devin/Issues/train-* - config_name: Issues_Human data_files: - split: train path: data/Human/Issues/train-* - config_name: Issues_Jules data_files: - split: train path: data/Jules/Issues/train-* - config_name: PullRequests_Claude data_files: - split: train path: data/Claude/PullRequests/train-* - config_name: PullRequests_Codex data_files: - split: train path: data/Codex/PullRequests/train-* - config_name: PullRequests_Copilot data_files: - split: train path: data/Copilot/PullRequests/train-* - config_name: PullRequests_Devin data_files: - split: train path: data/Devin/PullRequests/train-* - config_name: PullRequests_Human data_files: - split: train path: data/Human/PullRequests/train-* - config_name: PullRequests_Jules data_files: - split: train path: data/Jules/PullRequests/train-* - config_name: Repositories_Claude data_files: - split: train path: data/Claude/Repositories/train-* - config_name: Repositories_Codex data_files: - split: train path: data/Codex/Repositories/train-* - config_name: Repositories_Copilot data_files: - split: train path: data/Copilot/Repositories/train-* - config_name: Repositories_Devin data_files: - split: train path: data/Devin/Repositories/train-* - config_name: Repositories_Human data_files: - split: train path: data/Human/Repositories/train-* - config_name: Repositories_Jules data_files: - split: train path: data/Jules/Repositories/train-* - config_name: Reviews_Claude data_files: - split: train path: data/Claude/Reviews/train-* - config_name: Reviews_Codex data_files: - split: train path: data/Codex/Reviews/train-* - config_name: Reviews_Copilot data_files: - split: train path: data/Copilot/Reviews/train-* - config_name: Reviews_Devin data_files: - split: train path: data/Devin/Reviews/train-* - config_name: Reviews_Human data_files: - split: train path: data/Human/Reviews/train-* - config_name: Reviews_Jules data_files: - split: train path: data/Jules/Reviews/train-* --- # Agent Activity Dataset This dataset is released in conjunction with the paper **Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time**, accepted at [**MSR 2026**](https://2026.msrconf.org/details/msr-2026-technical-papers/21/Investigating-Autonomous-Agent-Contributions-in-the-Wild-Activity-Patterns-and-Code-). ## Dataset Overview The dataset contains a total of **111,969** Pull Requests **(June through August 2025)** from both coding agents (**Claude Code**, **OpenAI Codex**, **GitHub Copilot**, **Google Jules**, and **Devin**) and **human** contributors. It also includes additional activity metadata such as repositories, commits, comments, issues, reviews, and modified files. A summary of the dataset is presented below. | PR Author | #PR | #Repository | #Commit | #Comment | #Review | #Issue | #Changed File | |----------------|--------|-------------|---------|----------|---------|--------|---------------| | OpenAI Codex | 20,835 | 41,669 | 27,530 | 3,693 | 1,957 | 45 | 90,822 | | Claude Code | 19,148 | 38,260 | 82,755 | 22,329 | 12,728 | 4,052 | 255,275 | | GitHub Copilot | 18,563 | 37,125 | 69,896 | 26,664 | 20,665 | 9,744 | 158,404 | | Google Jules | 18,468 | 36,936 | 41,032 | 5,700 | 3,249 | 2,185 | 138,610 | | Devin | 14,045 | 28,090 | 51,641 | 27,518 | 6,901 | 294 | 131,454 | | Human | 20,910 | 41,542 | 102,037 | 18,559 | 21,401 | 1,973 | 194,861 | ## Dataset Structure The schema of the dataset is shown below. Solid lines indicate entities, while dotted lines represent nested objects. ![Dataset Overview](https://cdn-uploads.huggingface.co/production/uploads/68ffec465d1b138dc097e213/pZKV46gsmPsJzGc0gVZQP.png) - **Pull Request**: records the content, state, and activity of a pull request, including author, repository references, timestamps, and total number of commits, reviews, comments, closed issues, labels, and files changed. - **Repository**: stores a repository's ownership, visibility, status flags, popularity metrics, programming languages, topics, licensing, timestamps, and descriptive information. - **Commit**: captures a commit's identity, content, timestamps, authoring and committing information, changed files, and associated authors for a given pull request. - **Review**: lists a pull request review, including its identifier, author, content, state, timestamps, and minimization status. - **Comment**: represents a pull request comment with its identifier, author, content, timestamps, publication status, and minimization details. - **Issue**: stores information about an issue linked to a pull request, including its identifier, author, title, description, state, timestamps, type, labels, and other associated PRs. ## Dataset Usage Example loading by configuration, *pull requests*, *repositories*, *commits*, *comments*, *reviews*, and *issues* for *Claude*. The same applies for the other agents with configuration names: *Codex*, *Copilot*, *Devin*, *Jules*, *Human*. ```python claude_pullrequests = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'PullRequests_Claude', split='train') claude_repositories = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Repositories_Claude', split='train') claude_commits = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Commits_Claude', split='train') claude_comments = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Comments_Claude', split='train') claude_reviews = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Reviews_Claude', split='train') claude_issues = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', 'Issues_Claude', split='train') ``` Example loading by data directory for **Claude**. The same applies for the other agents with configuration names: *Codex*, *Copilot*, *Devin*, *Jules*, *Human*. ```python claude_pullrequests = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/PullRequests', split='train') claude_repositories = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Repositories', split='train') claude_commits = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Commits', split='train') claude_comments = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Comments', split='train') claude_reviews = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Reviews', split='train') claude_issues = load_dataset('AISE-TUDelft/MOSAIC-agentic-3m', data_dir='data/Claude/Issues', split='train') ```
提供机构:
AISE-TUDelft
搜集汇总
数据集介绍
main_image_url
构建方式
在软件工程与人工智能交叉领域,MOSAIC-agentic-3m数据集通过系统化采集GitHub平台上的协作数据构建而成。其核心方法涉及从多个知名AI代理(如Claude、Codex、Copilot、Devin、Jules)及人类开发者相关的Pull Requests、Issues、Commits和Comments中提取结构化信息。数据收集过程依托于版本控制系统的元数据,确保了每条记录的来源可追溯,并通过统一的架构对不同类型的数据条目进行标准化处理,从而形成涵盖多维度软件开发活动的综合性语料库。
特点
该数据集最显著的特征在于其细粒度的多模态结构,不仅区分了人类与多种AI代理的贡献,还按照软件开发的不同环节(如代码提交、问题追踪、代码审查)进行了分类。每个配置均包含丰富的元数据字段,例如作者信息、时间戳、修改内容统计及关联标识符,为分析AI辅助编程的行为模式提供了详实的上下文。数据规模的差异性反映了不同代理在实际项目中的参与程度,这种自然分布增强了数据集的真实性与代表性。
使用方法
研究人员可利用该数据集进行多方面的实证分析,例如通过对比人类与AI代理在代码提交信息、问题讨论或审查评论中的语言特征与行为模式,探索AI在协作开发中的影响。数据集支持以配置为单位进行加载,便于针对特定代理或活动类型开展专项研究。典型应用场景包括训练或评估代码生成模型、研究人机协作动力学,或作为基准测试数据用于检测AI生成内容的特征。使用时应遵循GPL-3.0许可协议,并注意不同子集在样本量上的差异,以确保分析结果的稳健性。
背景与挑战
背景概述
在人工智能代理技术迅猛发展的背景下,MOSAIC-agentic-3m数据集应运而生,旨在为智能体协作与代码生成研究提供大规模、细粒度的真实世界交互数据。该数据集由相关研究机构于近期构建,聚焦于探索多智能体在软件开发流程中的协同行为模式,核心研究问题涉及智能体间通信效率、任务分配策略以及代码贡献质量的量化评估。通过整合来自Claude、Codex、Copilot、Devin、Jules及人类开发者的拉取请求、提交记录、问题报告和评论等多模态数据,该数据集为理解智能体在复杂工程环境中的自主性与适应性奠定了实证基础,对推动软件工程智能化与多智能体系统研究具有显著影响力。
当前挑战
该数据集致力于解决智能体在软件开发协作中行为识别与性能评估的挑战,具体包括区分不同智能体生成内容的风格特征、量化智能体贡献的代码质量与创新性,以及建模多智能体交互的动态复杂性。在构建过程中,面临数据采集与清洗的严峻挑战,例如从GitHub等平台提取海量异构数据时需处理隐私与许可问题,确保数据标注的准确性与一致性,同时克服不同智能体输出格式的差异性与时间戳对齐的困难,以构建高质量、可复用的基准数据集。
常用场景
经典使用场景
在软件工程与人工智能交叉领域,MOSAIC-agentic-3m数据集为研究智能体驱动的代码协作行为提供了关键资源。该数据集经典地应用于训练和评估大语言模型在代码审查、提交信息生成和问题跟踪等任务上的表现。通过整合来自Claude、Codex、Copilot、Devin、Jules及人类开发者的多源数据,研究者能够深入分析智能体与人类在软件开发流程中的交互模式差异,为构建更高效的自动化编程助手奠定数据基础。
解决学术问题
该数据集有效解决了智能体行为可解释性、代码生成质量评估以及人机协作效率量化等核心学术问题。通过提供结构化、标注清晰的代码仓库活动记录,研究者能够系统性地探究智能体在真实开发环境中的决策逻辑与输出特性。其意义在于突破了以往仿真环境或小规模数据集的局限,为验证智能体在复杂软件工程场景中的实际效能提供了实证基础,推动了自动化编程研究从理论探索向实践验证的范式转变。
衍生相关工作
围绕该数据集已衍生出多项经典研究工作,例如基于多智能体行为对比的代码质量评估框架、融合时序信息的开发活动预测模型以及针对智能体生成内容的可信度验证方法。这些工作不仅深化了对智能体编码行为模式的理解,还催生了如自动化代码审查助手、智能提交信息生成器等一系列创新工具。相关研究进一步拓展至软件工程教育、开源社区治理等领域,形成了以数据驱动的智能软件开发方法论体系。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作