five

Maxscha/commitbench

收藏
Hugging Face2024-02-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Maxscha/commitbench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - code size_categories: - 1M<n<10M --- # CommitBench: A Benchmark for Commit Message Generation ## EXECUTIVE SUMMARY We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from GitHub repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP, and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g., excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input. ## CURATION RATIONALE We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provided as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split. ## DOCUMENTATION FOR SOURCE DATASETS Repository selection based on CodeSearchNet, which can be found under [https://github.com/github/CodeSearchNet](https://github.com/github/CodeSearchNet). ## LANGUAGE VARIETIES Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may be used for the same concept. Any model trained on this data should take these factors into account. ### Overview of split by programming language for CommitBench: - Java: 153,119 - Ruby: 233,710 - Go: 137,998 - JavaScript: 373,598 - Python: 472,469 - PHP: 294,394 ## SPEAKER DEMOGRAPHIC Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted. ## ANNOTATOR DEMOGRAPHIC Due to the automated generation of the dataset, no annotators were used. ## SPEECH SITUATION AND CHARACTERISTICS The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused, and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset. ## PREPROCESSING AND DATA FORMATTING See our paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors. ## CAPTURE QUALITY While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible. ## LIMITATIONS While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. ## METADATA - **License:** Dataset under the CC BY-NC 4.0 license, code under the MIT license ## DISCLOSURES AND ETHICAL REVIEW While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed. ## ABOUT THIS DOCUMENT A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software. This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at [https://techpolicylab.uw.edu/data-statements/](https://techpolicylab.uw.edu/data-statements/) and was updated from the community Version 1 Markdown template by Leon Derczynski.
提供机构:
Maxscha
原始信息汇总

CommitBench: 提交消息生成基准数据集

概述

CommitBench 是一个开源、可复现、注重隐私和许可的提交消息生成基准数据集。该数据集从允许再分发的 GitHub 仓库中收集,涵盖六种编程语言:Java、Python、Go、JavaScript、PHP 和 Ruby。提交消息以英语为主,适用于许多软件开发项目。数据集包含 1,664,590 个示例,通过广泛的质量过滤技术生成(例如,排除机器人提交)。此外,还提供了一个包含更长序列的版本,用于评估更长序列输入的模型。

数据集创建理由

由于之前的提交消息生成数据集存在质量和法律问题,我们创建了这个数据集。给定显示两个文件版本之间代码更改的 git diff,任务是预测描述这些更改的自然语言提交消息。我们基于 CodeSearchNet 的仓库选择,但应用了大量过滤技术以提高数据质量和消除噪声。由于原始仓库选择,我们仅限于上述编程语言。

源数据集文档

仓库选择基于 CodeSearchNet。

语言变体

由于 GitHub 托管来自世界各地的软件项目,提交消息中使用的英语没有单一的统一变体。这意味着措辞可能是地区性的或受程序员母语的影响。不同的拼写约定可能共存,同一概念可能有不同的术语。任何基于此数据训练的模型都应考虑这些因素。

按编程语言划分的 CommitBench 概览:

  • Java: 153,119
  • Ruby: 233,710
  • Go: 137,998
  • JavaScript: 373,598
  • Python: 472,469
  • PHP: 294,394

说话者人口统计

由于软件开发社区的极端多样性(地理上和经济上),数据没有单一的人口统计来源。全球范围内,平均软件开发人员倾向于男性并具有高等教育背景。由于 GitHub 个人资料的匿名性,无法提取性别分布信息。

标注者人口统计

由于数据集是自动生成的,没有使用标注者。

语音情境和特征

原始 GitHub 用户创建的数据具有公开性和通常与业务相关的性质,促进了更中性、信息导向和正式的语言。由于开发人员通常认为编写提交消息很繁琐,因此提交消息也可能反映提交作者的挫败感或无聊。尽管我们的过滤旨在捕捉这些类型的消息,但数据集中可能仍有一些实例。

预处理和数据格式

预处理步骤见我们的论文。由于隐私问题,我们不提供未处理的原始数据,但可以通过 CodeSearchNet 或向作者请求获取。

数据质量

尽管我们的数据集在编写时是完全可复现的,但存在外部依赖性可能会限制这一点。如果 GitHub 关闭并且数据集中的软件项目删除其仓库,可能会有不可复现的实例。

限制

尽管我们的过滤旨在确保每个数据样本的高质量,但我们不能确保只删除了低质量的示例。同样,我们不能保证我们广泛的过滤方法能捕捉到所有低质量的示例。数据集中可能仍有一些实例。我们的数据集的另一个限制是编程语言数量较少(还有更多)以及我们专注于英语提交消息。

元数据

  • 许可证: 数据集在 CC BY-NC 4.0 许可证下,代码在 MIT 许可证下
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作