JetBrains-Research/commit-chronicle

Name: JetBrains-Research/commit-chronicle
Creator: JetBrains-Research
Published: 2023-10-05 10:50:00
License: 暂无描述

Hugging Face2023-10-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/JetBrains-Research/commit-chronicle

下载链接

链接失效反馈

官方服务：

资源简介：

CommitChronicle是一个用于提交消息生成和/或完成的数据集，包含来自11.9k个GitHub仓库的10.7M次提交，涵盖20种编程语言。数据集具有大规模、多语言和多样性的特点，避免了对提交消息或提交差异结构的严格过滤。数据集提供了关于提交作者和日期的元数据，并采用按项目分割的方式，适合进行提交历史相关的实验。数据集创建过程中使用了GitHub Search工具和官方GitHub API，数据经过清洗和过滤，以避免包含个人信息和重复数据。数据集结构包括多个字段，如提交的仓库、哈希值、作者、日期、时区、许可证、语言、提交消息等。数据分割包括默认配置和两个子集配置。使用数据集时需注意可能包含的敏感信息，并遵守原始许可证的条款。

CommitChronicle is a dataset for commit message generation and/or completion, comprising 10.7 million commits from 11.9 thousand GitHub repositories and covering 20 programming languages. The dataset is characterized by large-scale, multilingual and diverse attributes, with no strict filtering applied to commit messages or commit diff structures. It provides metadata including commit authors and dates, and adopts a project-wise splitting strategy, making it suitable for experiments related to commit history. The dataset was constructed using GitHub Search and the official GitHub API, and the raw data was cleaned and filtered to eliminate personal information and duplicate entries. The dataset structure includes multiple fields such as repository, commit hash, author, date, time zone, license, programming language, and commit message, among others. The data splitting scheme includes a default configuration and two subset configurations. Users should note that the dataset may contain sensitive information and must comply with the terms of the original licenses.

提供机构：

JetBrains-Research

原始信息汇总

数据集概述

数据集名称

CommitChronicle

数据集描述

用于提交信息生成（及/或完成）的数据集，引自论文"From Commit Message Generation to History-Aware Commit Message Completion", ASE 2023。

数据集特点

大规模和多语言：包含10.7M次提交，来自11.9k个GitHub仓库，涵盖20种编程语言。
多样性：避免对提交信息或提交差异结构的限制性过滤。
适合进行提交历史实验：提供有关提交作者和日期的元数据，并按项目进行分割。

数据集结构

数据实例

每个数据实例是一个提交。

数据字段

repo：提交仓库。
hash：提交哈希。
author：提交作者的唯一ID。
date：提交日期。
timezone：提交时区。
license：仓库许可证。
language：仓库主要语言。
message：处理后的提交信息。
original_message：未经处理的提交信息。
mods：提交中的文件修改列表。

文件修改字段

change_type：文件更改类型。
old_path：更改前文件路径。
new_path：更改后文件路径。
diff：当前文件的git diff。

数据分割

default：全部分割，包括训练、验证和测试。
subset_cmg：用于CMG方法实验的测试子集。
subset_llm：用于LLM实验的测试子集。

数据集大小

default：训练集7.66M次提交，验证集1.55M次提交，测试集1.49M次提交。
subset_cmg：测试集204k次提交。
subset_llm：测试集4k次提交。

许可证

other

语言

code
en

任务类别

text-generation
summarization

数据集配置

default
- 数据文件路径：data/test-*, data/train-*, data/validation-*
subset_cmg
- 数据文件路径：subset_cmg/test-*
subset_llm
- 数据文件路径：subset_llm/test-*

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集