five

ObscuraCoder/commit-chronicle

收藏
Hugging Face2024-09-25 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ObscuraCoder/commit-chronicle
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: C features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 116010891 num_examples: 138903 - name: validation num_bytes: 23655147 num_examples: 28739 - name: test num_bytes: 15089573 num_examples: 18108 download_size: 84218245 dataset_size: 154755611 - config_name: C++ features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 283988203 num_examples: 318858 - name: validation num_bytes: 71347480 num_examples: 81875 - name: test num_bytes: 40434527 num_examples: 45965 download_size: 210054789 dataset_size: 395770210 - config_name: Go features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 233348778 num_examples: 266236 - name: validation num_bytes: 50916905 num_examples: 55959 - name: test num_bytes: 49018314 num_examples: 55278 download_size: 177300040 dataset_size: 333283997 - config_name: Java features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 341095989 num_examples: 352118 - name: validation num_bytes: 70142853 num_examples: 75173 - name: test num_bytes: 73746462 num_examples: 75789 download_size: 237710158 dataset_size: 484985304 - config_name: Python features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 540249752 num_examples: 627007 - name: validation num_bytes: 83521636 num_examples: 98147 - name: test num_bytes: 92836805 num_examples: 113517 download_size: 390614385 dataset_size: 716608193 - config_name: Rust features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 78635879 num_examples: 88593 - name: validation num_bytes: 20637229 num_examples: 23356 - name: test num_bytes: 15019809 num_examples: 16851 download_size: 61646506 dataset_size: 114292917 - config_name: TypeScript features: - name: message dtype: string - name: diff dtype: string splits: - name: train num_bytes: 306561530 num_examples: 378871 - name: validation num_bytes: 56861379 num_examples: 72008 - name: test num_bytes: 62396070 num_examples: 77097 download_size: 224008717 dataset_size: 425818979 configs: - config_name: C data_files: - split: train path: C/train-* - split: validation path: C/validation-* - split: test path: C/test-* - config_name: C++ data_files: - split: train path: C++/train-* - split: validation path: C++/validation-* - split: test path: C++/test-* - config_name: Go data_files: - split: train path: Go/train-* - split: validation path: Go/validation-* - split: test path: Go/test-* - config_name: Java data_files: - split: train path: Java/train-* - split: validation path: Java/validation-* - split: test path: Java/test-* - config_name: Python data_files: - split: train path: Python/train-* - split: validation path: Python/validation-* - split: test path: Python/test-* - config_name: Rust data_files: - split: train path: Rust/train-* - split: validation path: Rust/validation-* - split: test path: Rust/test-* - config_name: TypeScript data_files: - split: train path: TypeScript/train-* - split: validation path: TypeScript/validation-* - split: test path: TypeScript/test-* tags: - code size_categories: - 100K<n<1M --- This is a filtered version of the [`JetBrains-Research/commit-chronicle`](https://huggingface.co/datasets/JetBrains-Research/commit-chronicle) dataset. It has been subsetted for the following languages: ```python [ "C", "C++", "Go", "Java", "Python", "Rust", "TypeScript" ] ``` Further filtering steps undertaken are: 1. Useless features have been removed and only the message and diff retained. 2. Only commits that modify a single file have been chosen. 3. Samples containing diffs longer than 1024 tokens (by the [`ObscuraCoder/Tokenizer`](https://huggingface.co/ObscuraCoder/Tokenizer) tokenizer estimate) have been discarded.

数据集信息如下: 各配置项详情: 1. 配置名称:C 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数116010891,样本量138903 - 验证集(validation):字节数23655147,样本量28739 - 测试集(test):字节数15089573,样本量18108 下载总大小:84218245,数据集总占用大小:154755611 2. 配置名称:C++ 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数283988203,样本量318858 - 验证集(validation):字节数71347480,样本量81875 - 测试集(test):字节数40434527,样本量45965 下载总大小:210054789,数据集总占用大小:395770210 3. 配置名称:Go 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数233348778,样本量266236 - 验证集(validation):字节数50916905,样本量55959 - 测试集(test):字节数49018314,样本量55278 下载总大小:177300040,数据集总占用大小:333283997 4. 配置名称:Java 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数341095989,样本量352118 - 验证集(validation):字节数70142853,样本量75173 - 测试集(test):字节数73746462,样本量75789 下载总大小:237710158,数据集总占用大小:484985304 5. 配置名称:Python 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数540249752,样本量627007 - 验证集(validation):字节数83521636,样本量98147 - 测试集(test):字节数92836805,样本量113517 下载总大小:390614385,数据集总占用大小:716608193 6. 配置名称:Rust 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数78635879,样本量88593 - 验证集(validation):字节数20637229,样本量23356 - 测试集(test):字节数15019809,样本量16851 下载总大小:61646506,数据集总占用大小:114292917 7. 配置名称:TypeScript 特征字段: - 字段名:`message`,数据类型:字符串 - 字段名:`diff`,数据类型:字符串 数据集划分: - 训练集(train):字节数306561530,样本量378871 - 验证集(validation):字节数56861379,样本量72008 - 测试集(test):字节数62396070,样本量77097 下载总大小:224008717,数据集总占用大小:425818979 配置详情: 各配置对应的数据文件路径如下: - 配置名称:C:训练集对应`C/train-*`,验证集对应`C/validation-*`,测试集对应`C/test-*` - 配置名称:C++:训练集对应`C++/train-*`,验证集对应`C++/validation-*`,测试集对应`C++/test-*` - 配置名称:Go:训练集对应`Go/train-*`,验证集对应`Go/validation-*`,测试集对应`Go/test-*` - 配置名称:Java:训练集对应`Java/train-*`,验证集对应`Java/validation-*`,测试集对应`Java/test-*` - 配置名称:Python:训练集对应`Python/train-*`,验证集对应`Python/validation-*`,测试集对应`Python/test-*` - 配置名称:Rust:训练集对应`Rust/train-*`,验证集对应`Rust/validation-*`,测试集对应`Rust/test-*` - 配置名称:TypeScript:训练集对应`TypeScript/train-*`,验证集对应`TypeScript/validation-*`,测试集对应`TypeScript/test-*` 数据集标签:代码(code) 样本量范围:100K < n < 1M(即10万至100万样本) 本数据集为[`JetBrains-Research/commit-chronicle`](https://huggingface.co/datasets/JetBrains-Research/commit-chronicle)数据集的过滤版本,已针对以下语言进行子集筛选: python ["C", "C++", "Go", "Java", "Python", "Rust", "TypeScript"] 本次额外执行的过滤步骤包括: 1. 移除冗余特征,仅保留提交消息`message`与代码差分`diff`两个字段。 2. 仅选取仅修改单个文件的代码提交。 3. 丢弃经[`ObscuraCoder/Tokenizer`](https://huggingface.co/ObscuraCoder/Tokenizer)分词器估算长度超过1024个Token的样本。
提供机构:
ObscuraCoder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作