452,000,000 public Git commits on GitHub (October 2016)
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/records/285467
下载链接
链接失效反馈官方服务:
资源简介:
What's inside
part-000xx.lzo - LZO archives with the data (refer to "Format").
part-000xx.lzo.index - LZO index files so that the archives are splittable in Hadoop.
stats.csv.gz - GZIP-ed CSV file with some repository statistics related to the commits.
Format
part-000xx - text, one line per repository, every line is JSON with the following scheme:
{ "r": "repository name", "c": [{ "h": "git hash", "a": "author's email hash", "t": "date and time commit was created", "m": "commit message" }, ...] }
Date and time format is mostly Go language's time.Time.String(), I recommend to use dateutil.parse() to parse it with Python.
Commit message contains explicit \r and \n symbols in order to be a single line.
stats.csv has 4 columns: repository name, number of commits, number of contributors, average length of the commit messages.
创建时间:
2020-01-24



