下载链接：

https://modelscope.cn/datasets/JetBrains/git_good_bench-train

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary GitGoodBench Lite is a subset of 17469 samples for collecting trajectories of AI agents resolving git tasks (see Supported Scenarios) for model training purposes. We support the programming languages Python, Java and Kotlin and the sample types merge conflict resolution and file-commit chain. All data in this dataset are collected from 816 unique, open-source GitHub repositories with permissive licenses that have >= 1000 stars, >= 5 branches, >= 10 contributors and are not a fork or archived. We collected the initial list of repositories using [SEART.](https://seart-ghs.si.usi.ch/) [For further details see our paper.]() # Supported Tasks GitGoodBench Lite contains two types of samples: 'merge' and 'file_commit_chain'. It is important to note that the sample type 'file_commit_chain' can be used for two scenario types: Performing an interactive rebase to clean up the local tree or iteratively generating commits based on the staged, uncommitted changes. ## Merge Merge scenarios are contain one or more merge conflicts that occurred during a merge. All merge conflicts are guaranteed to be in a Python, Java or Kotlin file. There are only merges with exactly two parents in our dataset (no octopus merges). A merge scenario looks as follows: ``` { 'merge_commit_hash': '9bcf252fb11ec692dfbc152933dddd427098dcc9', 'parents': ['5d5df76aa7df56bdbec07c18e063a1125cfd0465', '3bf663778b2a56c614818069043354d4b6d5f156'], 'number_of_files_with_merge_conflict': 1, 'total_number_of_merge_conflicts': 2, 'files_in_merge_conflict': ['models/index_model.py'] } ``` Where `merge_commit_hash` contains the ground truth merge commit and the `parents` are the commits during the merge of which the conflict(s) in `files_in_merge_conflict` occurred. ## File-Commit Chain File-commit chain scenarios consist of two commits, the oldest and newest commit. In all commits between the `oldest_commit` and `newest_commit` (inclusive) `file` was modified. In total the chain consists of `times_seen_consecutively` commits. The intended use-cases of these scenarios are to evaluate the agent's capacity to create meaningful, cohesive commits or improve the local tree via rebasing. Thus samples of this `sample_type` cover two scenario types. File-commit chains are at least 3 commits long, the file the sample concerns itself with is guaranteed to be of `programming_language` (this is not the case for other potential files in the commits of the sample) and no commit is a merge commit. A `file_commit_chain` scenario looks as follows: ``` { 'file': 'torchaudio/transforms/_transforms.py', 'branch': 'main', 'times_seen_consecutively': 3, 'purity': 0.69, 'newest_commit': '7ac3e2e237e443baf91dfbf9893fca114c1c6001', 'oldest_commit': '3742cebb7dc0f8adf24f4ee1cea368195c448f78' } ``` `purity` indicates the relative amount of changes in the chain that occurred solely in `file` and is a heuristic for the difficulty of the scenario. We expect noisier scenarios to be more difficult. # Dataset Structure The following table provides per-field details. Columns marked “Yes” under **Is Metadata?** are those that provide contextual or descriptive information but are not essential to the primary scenario logic. | **Field** | **Type** | **Description** | **Is Metadata?** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | string | A unique identifier for the dataset entry: <name>-<sample_type>-<running_index> | No | | **name** | string | The repository name, in “owner/repository” format. | No | | **default_branch** | string | The primary or default branch for the repository. | No | | **license** | string | Repository license. | Yes | | **stargazers** | integer | The number of stars on GitHub. | Yes | | **created_at** | string | The repository creation date. | Yes | | **topics** | string | A semicolon-delimited list of topics or tags associated with the repository. | Yes | | **programming_language** | string | The programming language of the sample. Possible values: "java," "python," or "kotlin." | No | | **scenario** | string | A JSON string describing specific scenario data (e.g., merge-conflict details, parent commits). | No | | **sample_type** | string | The type of sample. Possible values: "merge" or "file_commit_chain." | No | | **project_size** | string | Estimated size based on lines of code. Possible values: "tiny," "small," "medium," "large," or "huge." | Yes | | **difficulty** | string | The complexity level of the scenario. Possible values: "easy," "medium," or "hard." | Yes | **Note**: - Fields marked as **Is Metadata? = Yes** provide contextual information (e.g., project stats, licensing) rather than forming the core logic of a scenario. - Fields marked **No** represent the primary data for the scenario. Use them to inform or categorize the scenario type and project details. # Dataset statistics We provide some statistics on the diversity of our dataset with respect to repositories, programming languages and merge conflict resolution samples. ## Dataset Skew The below statistics show that our dataset does not exhibit an extreme skew towards some repositories and is relatively well balanced with respect to source repositories. ### Distribution Statistics - Total number of repositories analyzed: 816 - Average (mean) samples per repository: 21.4 - Standard deviation (std): 48.8 - Minimum (min): 1 - 25th percentile (25%): 2 - Median (50%): 6 - 75th percentile (75%): 18 - Maximum (max): 644 ### Top-10 Repositories by Sample Count | Repository | Percentage of Total Samples | |------------------------------------------|----------------------------:| | zulip/zulip | 3.69% | | trinodb/trino | 2.47% | | wandb/wandb | 2.46% | | facebook/litho | 2.16% | | oss-review-toolkit/ort | 1.96% | | apache/tomcat | 1.94% | | nvidia/nemo | 1.76% | | h2oai/h2ogpt | 1.32% | | conan-io/conan | 1.30% | | huggingface/transformers | 1.05% | ### Distribution of Programming Languages We do however note a severe skew towards Python and Java with only 3.8% of samples being Kotlin. | Programming Language | Count | Percentage | |----------------------|--------:|-----------:| | python | 10985 | 62.82% | | java | 5881 | 33.67% | | kotlin | 603 | 3.45% | ## Difficulty Distribution for "merge" Scenarios | Difficulty | Proportion | |------------|-----------:| | easy | 0.516466 | | hard | 0.299672 | | medium | 0.183861 | **Languages** We note that the text data in this dataset consists mostly of: commit messages, comments and is primarily in English. We do however not filter for any human languages explcitly. # Cite Us ```bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11{\%} solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." } ```

# 数据集概述 GitGoodBench Lite 为包含17469条样本的子集数据集，用于采集AI智能体（AI Agent）解决Git任务的轨迹数据，以服务于模型训练场景，支持的任务场景详见下文。本数据集支持Python、Java及Kotlin三种编程语言，涵盖合并冲突（merge conflict）解决与文件-提交链（file_commit_chain）两种样本类型。本数据集所有样本均采集自816个独立的开源GitHub仓库，这些仓库需满足以下条件：采用宽松开源许可协议、GitHub星标数≥1000、分支数≥5、贡献者≥10，且未被复刻（fork）或归档（archived）。本数据集的初始仓库列表通过SEART工具（https://seart-ghs.si.usi.ch/）采集得到。如需进一步了解细节，请参阅我们的研究论文。 # 支持任务 GitGoodBench Lite 包含两类样本：「merge（合并冲突）」与「file_commit_chain（文件-提交链）」。需特别说明的是，「file_commit_chain」类型样本可应用于两类场景：通过交互式变基（interactive rebase）清理本地提交树，或基于暂存区未提交的变更迭代生成提交记录。 ## 合并冲突场景合并冲突场景包含一次或多次合并过程中产生的代码冲突。所有合并冲突均存在于Python、Java或Kotlin格式的代码文件中。本数据集仅包含双父提交的合并操作（无章鱼合并（octopus merge）场景）。一个典型的合并冲突场景示例如下： { 'merge_commit_hash': '9bcf252fb11ec692dfbc152933dddd427098dcc9', 'parents': ['5d5df76aa7df56bdbec07c18e063a1125cfd0465', '3bf663778b2a56c614818069043354d4b6d5f156'], 'number_of_files_with_merge_conflict': 1, 'total_number_of_merge_conflicts': 2, 'files_in_merge_conflict': ['models/index_model.py'] } 其中，`merge_commit_hash` 字段为真实合并提交的哈希值，`parents` 字段为产生`files_in_merge_conflict`中所列冲突的合并操作涉及的两个父提交。 ## 文件-提交链场景文件-提交链场景包含两个核心提交：最早提交与最新提交。在`oldest_commit`至`newest_commit`（含两端）的所有提交中，目标文件`file`均被修改过。整条提交链的总提交数为`times_seen_consecutively`所指定的数值。该类型场景的设计目标为：评估智能体生成具备逻辑连贯性的有效提交记录的能力，或通过变基操作优化本地提交树。因此该`sample_type`类型的样本覆盖两类任务场景。文件-提交链的长度至少为3次提交，样本所关注的目标文件必然属于`programming_language`所指定的编程语言（样本提交中其他潜在文件不满足此条件），且所有提交均非合并提交。一个典型的文件-提交链场景示例如下： { 'file': 'torchaudio/transforms/_transforms.py', 'branch': 'main', 'times_seen_consecutively': 3, 'purity': 0.69, 'newest_commit': '7ac3e2e237e443baf91dfbf9893fca114c1c6001', 'oldest_commit': '3742cebb7dc0f8adf24f4ee1cea368195c448f78' } `purity` 字段用于表征提交链中仅针对目标文件`file`的变更占总变更的相对比例，是衡量场景难度的启发式指标。我们认为，变更噪声更高的场景难度也相应更高。 # 数据集结构下表列出了各字段的详细说明。标记为「Is Metadata? = Yes」的字段为元数据字段，仅提供上下文或描述性信息，不参与场景核心逻辑。 | **字段名** | **数据类型** | **字段描述** | **是否为元数据？** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | 字符串 | 数据集条目的唯一标识符：`<仓库名>-<样本类型>-<运行索引>` | 否 | | **name** | 字符串 | 仓库名称，格式为「所有者/仓库名」。 | 否 | | **default_branch** | 字符串 | 仓库的主分支或默认分支。 | 否 | | **license** | 字符串 | 仓库开源许可协议。 | 是 | | **stargazers** | 整数 | GitHub仓库的星标数。 | 是 | | **created_at** | 字符串 | 仓库创建日期。 | 是 | | **topics** | 字符串 | 分号分隔的仓库关联主题或标签列表。 | 是 | | **programming_language** | 字符串 | 样本所用的编程语言，可选值为："java"、"python"或"kotlin"。 | 否 | | **scenario** | 字符串 | 描述具体场景数据的JSON字符串（如合并冲突详情、父提交信息）。 | 否 | | **sample_type** | 字符串 | 样本类型，可选值为："merge"或"file_commit_chain"。 | 否 | | **project_size** | 字符串 | 基于代码行数估算的项目规模，可选值为："tiny"、"small"、"medium"、"large"或"huge"。 | 是 | | **difficulty** | 字符串 | 场景复杂度等级，可选值为："easy"、"medium"或"hard"。 | 是 | **说明**： - 标记为**Is Metadata? = Yes**的字段仅提供场景的上下文信息（如项目统计数据、许可协议等），并非场景核心逻辑的组成部分。 - 标记为**No**的字段为场景的核心原始数据，可用于场景类型识别与项目详情分类。 # 数据集统计信息我们从仓库分布、编程语言分布及合并冲突解决样本维度，提供了本数据集的多样性统计结果。 ## 仓库分布偏斜度下述统计结果表明，本数据集未出现过度集中于少数仓库的偏斜现象，在源仓库维度上分布相对均衡。 ### 分布统计指标 - 总分析仓库数：816 - 单仓库平均样本数（均值）：21.4 - 标准差：48.8 - 最小值：1 - 25%分位数：2 - 中位数：6 - 75%分位数：18 - 最大值：644 ### 样本数Top10仓库 | 仓库名称 | 占总样本比例 | |------------------------------------------|----------------------------:| | zulip/zulip | 3.69% | | trinodb/trino | 2.47% | | wandb/wandb | 2.46% | | facebook/litho | 2.16% | | oss-review-toolkit/ort | 1.96% | | apache/tomcat | 1.94% | | nvidia/nemo | 1.76% | | h2oai/h2ogpt | 1.32% | | conan-io/conan | 1.30% | | huggingface/transformers | 1.05% | ### 编程语言分布但需注意，本数据集在编程语言维度存在显著偏斜：仅3.8%的样本为Kotlin语言。 | 编程语言 | 样本数 | 占比 | |----------------------|--------:|-----------:| | python | 10985 | 62.82% | | java | 5881 | 33.67% | | kotlin | 603 | 3.45% | ## 「merge」场景难度分布 | 难度等级 | 占比 | |------------|-----------:| | easy | 0.516466 | | hard | 0.299672 | | medium | 0.183861 | **语言说明**：本数据集的文本数据主要包含提交信息与代码注释，且以英文为主，但我们未对人类语言进行显式过滤。 # 引用格式 bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." }

应用场景：