ObscuraCoder/commit-chronicle
收藏Hugging Face2024-09-25 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ObscuraCoder/commit-chronicle
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: C
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 116010891
num_examples: 138903
- name: validation
num_bytes: 23655147
num_examples: 28739
- name: test
num_bytes: 15089573
num_examples: 18108
download_size: 84218245
dataset_size: 154755611
- config_name: C++
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 283988203
num_examples: 318858
- name: validation
num_bytes: 71347480
num_examples: 81875
- name: test
num_bytes: 40434527
num_examples: 45965
download_size: 210054789
dataset_size: 395770210
- config_name: Go
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 233348778
num_examples: 266236
- name: validation
num_bytes: 50916905
num_examples: 55959
- name: test
num_bytes: 49018314
num_examples: 55278
download_size: 177300040
dataset_size: 333283997
- config_name: Java
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 341095989
num_examples: 352118
- name: validation
num_bytes: 70142853
num_examples: 75173
- name: test
num_bytes: 73746462
num_examples: 75789
download_size: 237710158
dataset_size: 484985304
- config_name: Python
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 540249752
num_examples: 627007
- name: validation
num_bytes: 83521636
num_examples: 98147
- name: test
num_bytes: 92836805
num_examples: 113517
download_size: 390614385
dataset_size: 716608193
- config_name: Rust
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 78635879
num_examples: 88593
- name: validation
num_bytes: 20637229
num_examples: 23356
- name: test
num_bytes: 15019809
num_examples: 16851
download_size: 61646506
dataset_size: 114292917
- config_name: TypeScript
features:
- name: message
dtype: string
- name: diff
dtype: string
splits:
- name: train
num_bytes: 306561530
num_examples: 378871
- name: validation
num_bytes: 56861379
num_examples: 72008
- name: test
num_bytes: 62396070
num_examples: 77097
download_size: 224008717
dataset_size: 425818979
configs:
- config_name: C
data_files:
- split: train
path: C/train-*
- split: validation
path: C/validation-*
- split: test
path: C/test-*
- config_name: C++
data_files:
- split: train
path: C++/train-*
- split: validation
path: C++/validation-*
- split: test
path: C++/test-*
- config_name: Go
data_files:
- split: train
path: Go/train-*
- split: validation
path: Go/validation-*
- split: test
path: Go/test-*
- config_name: Java
data_files:
- split: train
path: Java/train-*
- split: validation
path: Java/validation-*
- split: test
path: Java/test-*
- config_name: Python
data_files:
- split: train
path: Python/train-*
- split: validation
path: Python/validation-*
- split: test
path: Python/test-*
- config_name: Rust
data_files:
- split: train
path: Rust/train-*
- split: validation
path: Rust/validation-*
- split: test
path: Rust/test-*
- config_name: TypeScript
data_files:
- split: train
path: TypeScript/train-*
- split: validation
path: TypeScript/validation-*
- split: test
path: TypeScript/test-*
tags:
- code
size_categories:
- 100K<n<1M
---
This is a filtered version of the [`JetBrains-Research/commit-chronicle`](https://huggingface.co/datasets/JetBrains-Research/commit-chronicle) dataset. It has been subsetted for the following languages:
```python
[
"C", "C++", "Go", "Java", "Python", "Rust", "TypeScript"
]
```
Further filtering steps undertaken are:
1. Useless features have been removed and only the message and diff retained.
2. Only commits that modify a single file have been chosen.
3. Samples containing diffs longer than 1024 tokens (by the [`ObscuraCoder/Tokenizer`](https://huggingface.co/ObscuraCoder/Tokenizer) tokenizer estimate) have been discarded.
数据集信息如下:
各配置项详情:
1. 配置名称:C
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数116010891,样本量138903
- 验证集(validation):字节数23655147,样本量28739
- 测试集(test):字节数15089573,样本量18108
下载总大小:84218245,数据集总占用大小:154755611
2. 配置名称:C++
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数283988203,样本量318858
- 验证集(validation):字节数71347480,样本量81875
- 测试集(test):字节数40434527,样本量45965
下载总大小:210054789,数据集总占用大小:395770210
3. 配置名称:Go
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数233348778,样本量266236
- 验证集(validation):字节数50916905,样本量55959
- 测试集(test):字节数49018314,样本量55278
下载总大小:177300040,数据集总占用大小:333283997
4. 配置名称:Java
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数341095989,样本量352118
- 验证集(validation):字节数70142853,样本量75173
- 测试集(test):字节数73746462,样本量75789
下载总大小:237710158,数据集总占用大小:484985304
5. 配置名称:Python
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数540249752,样本量627007
- 验证集(validation):字节数83521636,样本量98147
- 测试集(test):字节数92836805,样本量113517
下载总大小:390614385,数据集总占用大小:716608193
6. 配置名称:Rust
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数78635879,样本量88593
- 验证集(validation):字节数20637229,样本量23356
- 测试集(test):字节数15019809,样本量16851
下载总大小:61646506,数据集总占用大小:114292917
7. 配置名称:TypeScript
特征字段:
- 字段名:`message`,数据类型:字符串
- 字段名:`diff`,数据类型:字符串
数据集划分:
- 训练集(train):字节数306561530,样本量378871
- 验证集(validation):字节数56861379,样本量72008
- 测试集(test):字节数62396070,样本量77097
下载总大小:224008717,数据集总占用大小:425818979
配置详情:
各配置对应的数据文件路径如下:
- 配置名称:C:训练集对应`C/train-*`,验证集对应`C/validation-*`,测试集对应`C/test-*`
- 配置名称:C++:训练集对应`C++/train-*`,验证集对应`C++/validation-*`,测试集对应`C++/test-*`
- 配置名称:Go:训练集对应`Go/train-*`,验证集对应`Go/validation-*`,测试集对应`Go/test-*`
- 配置名称:Java:训练集对应`Java/train-*`,验证集对应`Java/validation-*`,测试集对应`Java/test-*`
- 配置名称:Python:训练集对应`Python/train-*`,验证集对应`Python/validation-*`,测试集对应`Python/test-*`
- 配置名称:Rust:训练集对应`Rust/train-*`,验证集对应`Rust/validation-*`,测试集对应`Rust/test-*`
- 配置名称:TypeScript:训练集对应`TypeScript/train-*`,验证集对应`TypeScript/validation-*`,测试集对应`TypeScript/test-*`
数据集标签:代码(code)
样本量范围:100K < n < 1M(即10万至100万样本)
本数据集为[`JetBrains-Research/commit-chronicle`](https://huggingface.co/datasets/JetBrains-Research/commit-chronicle)数据集的过滤版本,已针对以下语言进行子集筛选:
python
["C", "C++", "Go", "Java", "Python", "Rust", "TypeScript"]
本次额外执行的过滤步骤包括:
1. 移除冗余特征,仅保留提交消息`message`与代码差分`diff`两个字段。
2. 仅选取仅修改单个文件的代码提交。
3. 丢弃经[`ObscuraCoder/Tokenizer`](https://huggingface.co/ObscuraCoder/Tokenizer)分词器估算长度超过1024个Token的样本。
提供机构:
ObscuraCoder



