five

honesdev/angular-commits

收藏
Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/honesdev/angular-commits
下载链接
链接失效反馈
官方服务:
资源简介:
Angular提交消息数据集是一个从Angular仓库中抓取并清理过的git差异(diffs)与Conventional Commits消息配对的数据集,专为微调小型语言模型(如Gemma 3 270M、Qwen 2.5 Coder 0.5B/1.5B)设计,以根据代码差异自动生成提交消息。数据集采用聊天格式,包含系统提示、用户输入(差异内容)和助手响应(提交消息)。数据经过严格筛选,仅包括符合Conventional Commits规范(type(scope): subject格式)、主题长度在20-100字符之间、差异少于500行和8K字符、更改文件数为1-20个的提交,并排除了chore、revert、合并提交、版本更新、锁文件更改以及机器人添加的页脚等。原始约37,000个提交中,最终保留了5,043个示例,并重新平衡了类型分布,以确保模型对各种提交类型(如docs、refactor、ci、feat、test、fix、build各666个,perf 256个,style 125个)有均衡的暴露。数据集分为训练集(4,791个示例)和验证集(252个示例),建议在其他仓库(如Vue或Vite)的数据集上进行泛化测试。数据集基于MIT许可证发布。

Angular Commit Messages is a cleaned dataset of git diffs paired with Conventional Commits messages, scraped from the Angular repository. It is built for fine-tuning small language models (e.g., Gemma 3 270M, Qwen 2.5 Coder 0.5B/1.5B) to generate commit messages from diffs. The dataset is formatted in a chat style expected by SFTTrainer, including roles (system, user, assistant) and content. It has been filtered to include only commits that conform to the Conventional Commits specification (type(scope): subject), have subject lengths of 20–100 characters, diffs under 500 lines and 8K characters, and 1–20 files changed, while dropping chore, revert, merge commits, version bumps, lock-file changes, bot-added footers, and trailing PR references. Out of approximately 37K original commits, 5,043 survived all filters and rebalancing. The type distribution is rebalanced to provide balanced exposure across types (docs, refactor, ci, feat, test, fix, build with 666 each, perf with 256, style with 125) rather than reflecting Angulars natural skew. The dataset is split into train (4,791 examples) and validation (252 examples) sets, and it is recommended to evaluate generalization on a separate dataset from a different repository (e.g., Vue or Vite). The dataset is licensed under MIT.
提供机构:
honesdev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作