下载链接：

https://modelscope.cn/datasets/JetBrains/git_good_bench-lite

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary GitGoodBench Lite is a subset of 120 samples for evaluating the performance of AI agents in resolving git tasks (see Supported Scenarios). The samples in the dataset are evenly split across the programming languages Python, Java and Kotlin and the sample types merge conflict resolution and file-commit gram. This dataset thus contains 20 samples per sample type and programming language. All data in this dataset are collected from 100 unique, open-source GitHub repositories with permissive licenses that have >= 1000 stars, >= 5 branches, >= 10 contributors and are not a fork or archived. We collected the initial list of repositories using [SEART.](https://seart-ghs.si.usi.ch/) Evaluation is to be performed by exact-match (EM) of diffs for the merge conflict setting and by LLM-as-a-Judge for the file-commit chain setting. [For further details see our paper.]() # Supported Tasks GitGoodBench Lite contains two types of samples: 'merge' and 'file_commit_chain'. It is important to note that the sample type 'file_commit_chain' can be used for two scenario types: Performing an interactive rebase to clean up the local tree or iteratively generating commits based on the staged, uncommitted changes. ## Merge Merge scenarios are contain one or more merge conflicts that occurred during a merge. All merge conflicts are guaranteed to be in a Python, Java or Kotlin file. There are only merges with exactly two parents in our dataset (no octopus merges). A merge scenario looks as follows: ``` { 'merge_commit_hash': 'baa37f65fdff5b780a50d5b5c6bf8bc3ade43815', 'parents': ['d758810c59a9134f437d60f73a82036749688ccb', '5dcd493c67ff863c69c1214f0892a80e4951087e'], 'number_of_files_with_merge_conflict': 2, 'total_number_of_merge_conflicts': 2, 'files_in_merge_conflict': ['cogs/gpt_3_commands_and_converser.py', 'models/openai_model.py'] } ``` Where `merge_commit_hash` contains the ground truth merge commit and the `parents` are the commits during the merge of which the conflict(s) in `files_in_merge_conflict` occurred. ## File-Commit Chain File-commit chain scenarios consist of two commits, the oldest and newest commit. In all commits between the `oldest_commit` and `newest_commit` (inclusive) `file` was modified. In total the chain consists of `times_seen_consecutively` commits. The intended use-cases of these scenarios are to evaluate the agent's capacity to create meaningful, cohesive commits or improve the local tree via rebasing. Thus samples of this `sample_type` cover two scenario types. File-commit chains are at least 3 commits long, the file the sample concerns itself with is guaranteed to be of `programming_language` (this is not the case for other potential files in the commits of the sample) and no commit is a merge commit. A `file_commit_chain` scenario looks as follows: ``` { 'file': 'composer/models/huggingface.py', 'branch': 'origin/vincent-mlflow-logger-verbose', 'times_seen_consecutively': 3, 'purity': 0.68, 'newest_commit': 'c24b29f19c4c131a3ea7098dd8b8a5edde344819', 'oldest_commit': 'c1ff80900f46d4e36feb4b326689fe14fc41cbc6' } ``` `purity` indicates the relative amount of changes in the chain that occurred solely in `file` and is a heuristic for the difficulty of the scenario. We expect noisier scenarios to be more difficult. # Dataset Structure The following table provides per-field details. Columns marked “Yes” under **Is Metadata?** are those that provide contextual or descriptive information but are not essential to the primary scenario logic. | **Field** | **Type** | **Description** | **Is Metadata?** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | string | A unique identifier for the dataset entry: <name>-<sample_type>-<running_index> | No | | **name** | string | The repository name, in “owner/repository” format. | No | | **default_branch** | string | The primary or default branch for the repository. | No | | **license** | string | Repository license. | Yes | | **stargazers** | integer | The number of stars on GitHub. | Yes | | **created_at** | string | The repository creation date. | Yes | | **topics** | string | A semicolon-delimited list of topics or tags associated with the repository. | Yes | | **programming_language** | string | The programming language of the sample. Possible values: "java," "python," or "kotlin." | No | | **scenario** | string | A JSON string describing specific scenario data (e.g., merge-conflict details, parent commits). | No | | **sample_type** | string | The type of sample. Possible values: "merge" or "file_commit_chain." | No | | **project_size** | string | Estimated size based on lines of code. Possible values: "tiny," "small," "medium," "large," or "huge." | Yes | | **difficulty** | string | The complexity level of the scenario. Possible values: "easy," "medium," or "hard." | Yes | **Note**: - Fields marked as **Is Metadata? = Yes** provide contextual information (e.g., project stats, licensing) rather than forming the core logic of a scenario. - Fields marked **No** represent the primary data for the scenario. Use them to inform or categorize the scenario type and project details. # Dataset statistics We provide some statistics on the diversity of our dataset with respect to repositories and merge conflict resolution samples. ## Dataset Skew We note that our dataset is skewed towards the top three repositories especially, however the skew flattens quickly. ### Distribution Statistics - Total number of repositories (count): 100 - Average (mean) samples per repository: 1.2 - Standard deviation (std): 0.79 - Minimum (min): 1 - 25th percentile (25%): 1 - Median (50%): 1 - 75th percentile (75%): 1 - Maximum (max): 8 ### Top-10 Repositories by Sample Count | Repository | Percentage of Total Samples | |------------------------------------------|----------------------------:| | oss-review-toolkit/ort | 6.67% | | stripe/stripe-android | 2.50% | | element-hq/element-android | 2.50% | | apache/hive | 1.67% | | coil-kt/coil | 1.67% | | wikimedia/apps-android-wikipedia | 1.67% | | facebookresearch/habitat-lab | 1.67% | | liquibase/liquibase | 1.67% | | google/guava | 1.67% | | kotlin/kotlinx.coroutines | 1.67% | ## Difficulty Distribution for "merge" Scenarios | Difficulty | Percentage | |------------|-------------| | easy | 51.67% | | medium | 21.67% | | hard | 26.67% | **Languages** We note that the text data in this dataset consists mostly of: commit messages, comments and is primarily in English. We do however not filter for any human languages explcitly. # Cite Us ```bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11{\%} solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." } ```

# 数据集概述 GitGoodBench Lite 是一个包含120个样本的子集，用于评估AI智能体（AI Agent）解决Git任务的性能（详见支持场景）。数据集内的样本均匀分布于Python、Java、Kotlin三种编程语言，以及合并冲突解决（merge conflict resolution）与文件提交链（file-commit chain）两种样本类型。因此，每类样本类型与编程语言的组合各包含20个样本。本数据集的所有数据均采集自100个独特的开源GitHub仓库，这些仓库需满足以下条件：采用宽松许可协议、星标数≥1000、分支数≥5、贡献者≥10，且未被复刻或归档。我们通过[SEART](https://seart-ghs.si.usi.ch/)获取了初始仓库列表。评估方式如下：合并冲突场景需通过差异精确匹配（Exact Match, EM）进行评测，文件提交链场景则需通过大语言模型作为评判者（LLM-as-a-Judge）进行评测。[更多细节请参阅我们的论文]() # 支持任务 GitGoodBench Lite 包含两类样本：'merge'（合并场景）与'file_commit_chain'（文件提交链场景）。需要注意的是，'file_commit_chain'样本类型可应用于两种场景：通过交互式变基（interactive rebase）清理本地代码树，或是基于暂存且未提交的更改迭代生成提交。 ## 合并场景合并场景包含一次或多次合并过程中产生的冲突。本数据集中所有合并冲突均出现在Python、Java或Kotlin文件中，且仅包含双父提交的合并（无章鱼合并（octopus merges））。一个合并场景示例如下： { 'merge_commit_hash': 'baa37f65fdff5b780a50d5b5c6bf8bc3ade43815', 'parents': ['d758810c59a9134f437d60f73a82036749688ccb', '5dcd493c67ff863c69c1214f0892a80e4951087e'], 'number_of_files_with_merge_conflict': 2, 'total_number_of_merge_conflicts': 2, 'files_in_merge_conflict': ['cogs/gpt_3_commands_and_converser.py', 'models/openai_model.py'] } 其中`merge_commit_hash`（合并提交哈希值）包含真实合并提交的哈希值，`parents`（父提交列表）为产生`files_in_merge_conflict`（存在合并冲突的文件列表）中冲突的合并过程涉及的提交。 ## 文件提交链文件提交链场景包含两个提交：最早提交与最新提交。在`oldest_commit`（最早提交）与`newest_commit`（最新提交）之间（包含两端）的所有提交中，`file`（目标文件）均被修改。该提交链总共包含`times_seen_consecutively`（连续提交次数）次提交。此类场景的预期用途为评估智能体创建有意义、连贯的提交的能力，或是通过变基优化本地代码树，因此该`sample_type`（样本类型）的样本覆盖两种场景类型。文件提交链的长度至少为3次提交，样本所涉及的文件编程语言为`programming_language`（编程语言）（样本提交中的其他文件不受此限制），且无任何提交为合并提交。一个`file_commit_chain`（文件提交链）场景示例如下： { 'file': 'composer/models/huggingface.py', 'branch': 'origin/vincent-mlflow-logger-verbose', 'times_seen_consecutively': 3, 'purity': 0.68, 'newest_commit': 'c24b29f19c4c131a3ea7098dd8b8a5edde344819', 'oldest_commit': 'c1ff80900f46d4e36feb4b326689fe14fc41cbc6' } `purity`（纯度）表示提交链中仅针对`file`（目标文件）的更改占总更改的相对比例，是衡量场景难度的启发式指标。我们认为噪声更多的场景难度更高。 # 数据集结构下表提供了每个字段的详细信息，标有“是（Yes）”的**是否为元数据？**列的字段仅提供上下文或描述性信息，而非场景核心逻辑的必要组成部分。 | **字段名** | **数据类型** | **字段描述** | **是否为元数据？** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | 字符串 | 数据集条目的唯一标识符：<名称>-<样本类型>-<运行索引> | 否 | | **name** | 字符串 | 仓库名称，格式为“所有者/仓库名”。 | 否 | | **default_branch** | 字符串 | 仓库的主分支或默认分支。 | 否 | | **license** | 字符串 | 仓库许可协议。 | 是 | | **stargazers** | 整数 | GitHub仓库的星标数量。 | 是 | | **created_at** | 字符串 | 仓库创建日期。 | 是 | | **topics** | 字符串 | 以分号分隔的仓库关联主题或标签列表。 | 是 | | **programming_language** | 字符串 | 样本对应的编程语言，可选值为"java"、"python"或"kotlin"。 | 否 | | **scenario** | 字符串 | 描述特定场景数据的JSON字符串（例如合并冲突详情、父提交信息）。 | 否 | | **sample_type** | 字符串 | 样本类型，可选值为"merge"或"file_commit_chain"。 | 否 | | **project_size** | 字符串 | 基于代码行数估算的项目规模，可选值为"tiny"（极小）、"small"（小型）、"medium"（中型）、"large"（大型）或"huge"（超大型）。 | 是 | | **difficulty** | 字符串 | 场景的复杂度等级，可选值为"easy"（简单）、"medium"（中等）或"hard"（困难）。 | 是 | **注意**： - 标有**是否为元数据=是**的字段仅提供上下文信息（例如项目统计数据、许可协议），而非场景核心逻辑的组成部分。 - 标有**否**的字段代表场景的核心数据，可用于场景类型和项目详情的分类与参考。 # 数据集统计信息我们提供了本数据集在仓库与合并冲突解决样本方面的多样性统计。 ## 数据集偏斜情况我们注意到数据集尤其偏向于前三个仓库，但这种偏斜会快速趋于平缓。 ### 分布统计数据 - 仓库总数：100 - 每个仓库的平均样本数：1.2 - 标准差：0.79 - 最小值：1 - 25%分位数：1 - 中位数：1 - 75%分位数：1 - 最大值：8 ### 按样本数量排名的前10个仓库 | 仓库名称 | 占总样本的百分比 | |------------------------------------------|----------------------------:| | oss-review-toolkit/ort | 6.67% | | stripe/stripe-android | 2.50% | | element-hq/element-android | 2.50% | | apache/hive | 1.67% | | coil-kt/coil | 1.67% | | wikimedia/apps-android-wikipedia | 1.67% | | facebookresearch/habitat-lab | 1.67% | | liquibase/liquibase | 1.67% | | google/guava | 1.67% | | kotlin/kotlinx.coroutines | 1.67% | ## 合并场景的难度分布 | 难度等级 | 占比 | |------------|-------------| | 简单 | 51.67% | | 中等 | 21.67% | | 困难 | 26.67% | **语言说明**：我们注意到本数据集的文本数据主要由提交信息、注释组成，且以英文为主，但我们未对人类语言进行显式过滤。 # 引用我们 bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." }

应用场景：