下载链接：

https://modelscope.cn/datasets/JetBrains/git_good_bench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Summary GitGoodBench Lite is a subset of 900 samples for evaluating the performance of AI agents in resolving git tasks (see Supported Scenarios). The samples in the dataset are evenly split across the programming languages Python, Java and Kotlin and the sample types merge conflict resolution and file-commit gram. This dataset thus contains 150 samples per sample type and programming language. All data in this dataset are collected from 479 unique, open-source GitHub repositories with permissive licenses that have >= 1000 stars, >= 5 branches, >= 10 contributors and are not a fork or archived. We collected the initial list of repositories using [SEART.](https://seart-ghs.si.usi.ch/) Evaluation is to be performed by exact-match (EM) of diffs for the merge conflict setting and by LLM-as-a-Judge for the file-commit chain setting. [For further details see our paper.]() # Supported Tasks GitGoodBench Lite contains two types of samples: 'merge' and 'file_commit_chain'. It is important to note that the sample type 'file_commit_chain' can be used for two scenario types: Performing an interactive rebase to clean up the local tree or iteratively generating commits based on the staged, uncommitted changes. ## Merge Merge scenarios are contain one or more merge conflicts that occurred during a merge. All merge conflicts are guaranteed to be in a Python, Java or Kotlin file. There are only merges with exactly two parents in our dataset (no octopus merges). A merge scenario looks as follows: ``` { 'merge_commit_hash': '1ce7091bffb09ad7e5123ea995c1f572a83bd375', 'parents': ['5ef9152860a8b0af02e9d5d3635601df963748c9', '8a353cf3392e0c20dc987bc18f4ab93edccf09b3'], 'number_of_files_with_merge_conflict': 1, 'total_number_of_merge_conflicts': 3, 'files_in_merge_conflict': ['src/test/java/net/openhft/chronicle/queue/LastAppendedTest.java'] } ``` Where `merge_commit_hash` contains the ground truth merge commit and the `parents` are the commits during the merge of which the conflict(s) in `files_in_merge_conflict` occurred. ## File-Commit Chain File-commit chain scenarios consist of two commits, the oldest and newest commit. In all commits between the `oldest_commit` and `newest_commit` (inclusive) `file` was modified. In total the chain consists of `times_seen_consecutively` commits. The intended use-cases of these scenarios are to evaluate the agent's capacity to create meaningful, cohesive commits or improve the local tree via rebasing. Thus samples of this `sample_type` cover two scenario types. File-commit chains are at least 3 commits long,the file the sample concerns itself with is guaranteed to be of `programming_language` (this is not the case for other potential files in the commits of the sample) and no commit is a merge commit. A `file_commit_chain` scenario looks as follows: ``` { 'file': 'App/Event.py', 'branch': 'origin/20230105', 'times_seen_consecutively': 4, 'purity': 0.78, 'newest_commit': '7547d1877f0af28a67fe0e1ccaefcb0020a89751', 'oldest_commit': 'a0a11bd4de009daae463c77fccdb2de16cfed6c4' } ``` `purity` indicates the relative amount of changes in the chain that occurred solely in `file` and is a heuristic for the difficulty of the scenario. We expect noisier scenarios to be more difficult. # Dataset Structure The following table provides per-field details. Columns marked “Yes” under **Is Metadata?** are those that provide contextual or descriptive information but are not essential to the primary scenario logic. | **Field** | **Type** | **Description** | **Is Metadata?** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | string | A unique identifier for the dataset entry: <name>-<sample_type>-<running_index> | No | | **name** | string | The repository name, in “owner/repository” format. | No | | **default_branch** | string | The primary or default branch for the repository. | No | | **license** | string | Repository license. | Yes | | **stargazers** | integer | The number of stars on GitHub. | Yes | | **created_at** | string | The repository creation date. | Yes | | **topics** | string | A semicolon-delimited list of topics or tags associated with the repository. | Yes | | **programming_language** | string | The programming language of the sample. Possible values: "java," "python," or "kotlin." | No | | **scenario** | string | A JSON string describing specific scenario data (e.g., merge-conflict details, parent commits). | No | | **sample_type** | string | The type of sample. Possible values: "merge" or "file_commit_chain." | No | | **project_size** | string | Estimated size based on lines of code. Possible values: "tiny," "small," "medium," "large," or "huge." | Yes | | **difficulty** | string | The complexity level of the scenario. Possible values: "easy," "medium," or "hard." | Yes | **Note**: - Fields marked as **Is Metadata? = Yes** provide contextual information (e.g., project stats, licensing) rather than forming the core logic of a scenario. - Fields marked **No** represent the primary data for the scenario. Use them to inform or categorize the scenario type and project details. # Dataset statistics We provide some statistics on the diversity of our dataset with respect to repositories and merge conflict resolution samples. ## Dataset Skew We note that our dataset is skewed towards the top four repositories especially, however skew flattens quickly. ### Distribution Statistics - Total number of repositories (count): 479 - Average (mean) samples per repository: 1.87 - Standard deviation (std): 2.8 - Minimum (min): 1 - 25th percentile (25%): 1 - Median (50%): 1 - 75th percentile (75%): 2 - Maximum (max): 46 ### Top-10 Repositories by Sample Count | Repository | Percentage of Total Samples | |------------------------------------------|----------------------------:| | oss-review-toolkit/ort | 5.11% | | stripe/stripe-android | 3.22% | | element-hq/element-android | 2.44% | | jetbrains/compose-multiplatform | 1.22% | | kotlin/dokka | 1.00% | | jetbrains/ideavim | 0.89% | | wikimedia/apps-android-wikipedia | 0.78% | | android/nowinandroid | 0.78% | | coil-kt/coil | 0.78% | | jetbrains/exposed | 0.78% | ## Difficulty Distribution for "merge" Scenarios | Difficulty | Percentage | |------------|-----------:| | easy | 41.33 | | medium | 24.44 | | hard | 34.22 | **Languages** We note that the text data in this dataset consists mostly of: commit messages, comments and is primarily in English. We do however not filter for any human languages explcitly. # Cite Us ```bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11{\%} solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." } ```

# 数据集概述 GitGoodBench Lite（Git良好基准精简版）是包含900个样本的子集，用于评估AI智能体（AI Agent）解决Git任务的性能（详见支持场景）。该数据集的样本均匀分布于Python、Java、Kotlin三种编程语言，以及合并冲突解决（merge conflict resolution）和文件提交链（file-commit chain）两种样本类型。因此，每种样本类型对应每种编程语言各150个样本。本数据集的所有数据均采集自479个独特的开源GitHub仓库，这些仓库需满足许可证宽松、星标数≥1000、分支数≥5、贡献者≥10，且未被复刻（fork）或归档（archived）。我们通过[SEART](https://seart-ghs.si.usi.ch/)工具获取了初始仓库列表。对于合并冲突场景，评估将通过差异的精确匹配（exact-match, EM）进行；对于文件提交链场景，则采用大语言模型作为评判者（LLM-as-a-Judge）的方式。[详细信息请参阅我们的论文。]() # 支持的任务 GitGoodBench Lite包含两种样本类型：`merge`（合并冲突）和`file_commit_chain`（文件提交链）。需要注意的是，`file_commit_chain`类型样本可应用于两种场景：通过交互式变基（interactive rebase）清理本地代码树，或基于暂存区的未提交更改迭代生成提交记录。 ## 合并场景合并冲突场景包含一次或多次在合并过程中产生的冲突。本数据集中所有合并冲突均出现于Python、Java或Kotlin代码文件中，且仅包含双亲合并（无章鱼合并（octopus merges））。一个合并冲突场景的示例如下： { 'merge_commit_hash': '1ce7091bffb09ad7e5123ea995c1f572a83bd375', 'parents': ['5ef9152860a8b0af02e9d5d3635601df963748c9', '8a353cf3392e0c20dc987bc18f4ab93edccf09b3'], 'number_of_files_with_merge_conflict': 1, 'total_number_of_merge_conflicts': 3, 'files_in_merge_conflict': ['src/test/java/net/openhft/chronicle/queue/LastAppendedTest.java'] } 其中，`merge_commit_hash`为基准合并提交的哈希值，`parents`为产生`files_in_merge_conflict`中所述合并冲突的两个父提交。 ## 文件提交链场景文件提交链场景包含两个提交：最早提交与最新提交。在`oldest_commit`至`newest_commit`（包含两端）的所有提交中，`file`指定的文件均被修改。该链总共包含`times_seen_consecutively`个提交。此类场景的设计用途为评估智能体生成有意义、连贯的提交记录，或通过变基优化本地代码树的能力。因此，该`sample_type`样本覆盖两种场景类型。文件提交链场景至少包含3个提交，样本涉及的文件必为`programming_language`指定的编程语言（样本提交中的其他文件不受此限制），且所有提交均非合并提交。一个`file_commit_chain`场景的示例如下： { 'file': 'App/Event.py', 'branch': 'origin/20230105', 'times_seen_consecutively': 4, 'purity': 0.78, 'newest_commit': '7547d1877f0af28a67fe0e1ccaefcb0020a89751', 'oldest_commit': 'a0a11bd4de009daae463c77fccdb2de16cfed6c4' } `purity`（纯净度）表示该提交链中仅针对`file`文件的更改占总更改的相对比例，是衡量场景难度的启发式指标。我们认为，噪声更多的场景难度更高。 # 数据集结构下表提供了各字段的详细说明。**是否为元数据？**列标记为“是”的字段仅提供上下文或描述性信息，并非场景核心逻辑的必要组成部分。 | **字段名** | **类型** | **描述** | **是否为元数据？** | |--------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | **id** | 字符串 | 数据集条目的唯一标识符，格式为`<名称>-<样本类型>-<运行索引>` | 否 | | **name** | 字符串 | 仓库名称，格式为“所有者/仓库名”。 | 否 | | **default_branch** | 字符串 | 仓库的主分支或默认分支。 | 否 | | **license** | 字符串 | 仓库许可证。 | 是 | | **stargazers** | 整数 | GitHub仓库的星标数。 | 是 | | **created_at** | 字符串 | 仓库创建日期。 | 是 | | **topics** | 字符串 | 以分号分隔的仓库关联主题或标签列表。 | 是 | | **programming_language** | 字符串 | 样本所属编程语言，可选值为`"java"`、`"python"`或`"kotlin"`。 | 否 | | **scenario** | 字符串 | 描述具体场景数据的JSON字符串（例如合并冲突详情、父提交信息）。 | 否 | | **sample_type** | 字符串 | 样本类型，可选值为`"merge"`或`"file_commit_chain"`。 | 否 | | **project_size** | 字符串 | 基于代码行数估算的项目规模，可选值为`"tiny"`（微型）、`"small"`（小型）、`"medium"`（中型）、`"large"`（大型）或`"huge"`（超大型）。 | 是 | | **difficulty** | 字符串 | 场景复杂度等级，可选值为`"easy"`（简单）、`"medium"`（中等）或`"hard"`（困难）。 | 是 | **注意**： - 标记为**是否为元数据？= 是**的字段仅提供上下文信息（例如项目统计数据、许可证信息），而非构成场景的核心逻辑。 - 标记为**否**的字段代表场景的核心数据，可用于场景类型与项目详情的分类或参考。 # 数据集统计我们提供了本数据集在仓库与合并冲突解决样本方面的多样性统计。 ## 数据集偏倚我们注意到，本数据集尤其向排名前四的仓库倾斜，但这种偏倚会迅速降低。 ### 分布统计 - 总仓库数（计数）：479 - 单个仓库的平均样本数：1.87 - 标准差：2.8 - 最小值：1 - 25%分位数：1 - 中位数：1 - 75%分位数：2 - 最大值：46 ### 按样本数排名的前10个仓库 | 仓库名 | 占总样本的百分比 | |------------------------------------------|----------------------------:| | oss-review-toolkit/ort | 5.11% | | stripe/stripe-android | 3.22% | | element-hq/element-android | 2.44% | | jetbrains/compose-multiplatform | 1.22% | | kotlin/dokka | 1.00% | | jetbrains/ideavim | 0.89% | | wikimedia/apps-android-wikipedia | 0.78% | | android/nowinandroid | 0.78% | | coil-kt/coil | 0.78% | | jetbrains/exposed | 0.78% | ## 合并场景的难度分布 | 难度等级 | 占比 | |------------|-----------:| | 简单 | 41.33 | | 中等 | 24.44 | | 困难 | 34.22 | **语言说明**：本数据集的文本数据主要包含提交信息与注释，且以英文为主，但我们未对人类语言进行显式过滤。 # 引用我们 bibtex @inproceedings{lindenbauer-etal-2025-gitgoodbench, title = "{G}it{G}ood{B}ench: A Novel Benchmark For Evaluating Agentic Performance On Git", author = "Lindenbauer, Tobias and Bogomolov, Egor and Zharov, Yaroslav", editor = "Kamalloo, Ehsan and Gontier, Nicolas and Lu, Xing Han and Dziri, Nouha and Murty, Shikhar and Lacoste, Alexandre", booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.realm-1.19/", doi = "10.18653/v1/2025.realm-1.19", pages = "272--288", ISBN = "979-8-89176-264-0", abstract = "Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming." }

应用场景：