five

github_archive

收藏
魔搭社区2025-08-08 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/github_archive
下载链接
链接失效反馈
官方服务:
资源简介:
# GitHub Archive ## Description According to [GitHub’s terms of service](https://docs.github.com/en/site-policy/github-terms/github-terms-of-service), issues and pull request descriptions—along with the their comments—inherit the license of their associated repository. To collect this data, we used the [GitHub Archive’s](https://www.gharchive.org/) public BigQuery table of events to extracted all issue, pull request, and comment events since 2011 and aggregated them into threads. The table appeared to be missing “edit” events so the text from each comment is the original from when it was first posted. We filtered out comments from bots. This resulted in approximately 177 million threads across 19 million repositories. We then removed threads whose repositories did not have a Blue Oak Council-approved license. License information for each repository comes from either 1) the “public-data:github_repos” BigQuery Table, 2) metadata from the StackV2, or 3) the GitHub API. License filtering left 10 million repositories. PyMarkdown was used to convert from GitHub-flavored markdown to plain text. When parsing failed, the raw markdown was kept. Per-document license information is available in the `license` entry of the `metadata` field of each example. Code for collecting, processing, and preparing this dataset is available in the [common-pile GitHub repo](https://github.com/r-three/common-pile). ## Dataset Statistics | Documents | UTF-8 GB | |-----------|----------| | 30,318,774 | 54.7 | ## License Issues While we aim to produce datasets with completely accurate licensing information, license laundering and inaccurate metadata can cause us to erroneously assign the incorrect license to some documents (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)). If you believe you have found an instance of incorrect licensing in this dataset, please [start a discussion](https://github.com/r-three/common-pile/discussions/new) on this repository. ## Other Versions This is the "raw" version of the GitHub Archive dataset. If you are looking for the filtered version used to train [Comma v0.1](https://huggingface.co/common-pile/comma-v0.1), you can find it [here](https://huggingface.co/datasets/common-pile/github_archive_filtered). ## Citation If you use this dataset, please cite: ```bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} } ```

# GitHub 归档数据集(GitHub Archive) ## 描述 根据[GitHub服务条款](https://docs.github.com/en/site-policy/github-terms/github-terms-of-service),议题(Issue)与拉取请求(Pull Request)的描述文本及其附带评论,将继承其关联仓库的许可协议。 为采集该数据集,我们借助[GitHub归档项目(GitHub Archive)](https://www.gharchive.org/)公开的BigQuery事件表,提取了2011年以来所有议题、拉取请求及评论事件,并将其聚合为讨论线程(Thread)。经查该表缺失“编辑”事件记录,因此每条评论的文本均为其首次发布时的原始内容。我们已过滤掉来自机器人(Bot)的评论,最终得到覆盖1900万个仓库的约1.77亿条讨论线程。 随后我们移除了所属仓库未获得蓝橡树委员会(Blue Oak Council)认证许可的讨论线程。各仓库的许可协议信息来源分为三类:1)`public-data:github_repos` BigQuery表;2)StackV2元数据;3)GitHub应用程序编程接口(API)。经过许可协议过滤后,剩余1000万个仓库的相关数据。 我们使用PyMarkdown将GitHub风格的Markdown格式转换为纯文本,若解析失败则保留原始Markdown内容。每条数据样本的元数据(Metadata)字段中的`license`项,即包含该样本所属文档的许可协议信息。本数据集的采集、处理与制备代码已开源至[common-pile GitHub仓库](https://github.com/r-three/common-pile)。 ## 数据集统计 | 文档数量 | UTF-8 存储量(GB) | |---------|-------------------| | 30,318,774 | 54.7 | ## 许可协议相关问题 尽管我们致力于打造许可协议信息完全准确的数据集,但许可协议洗白(License Laundering)与元数据不准确等问题,可能导致我们误将错误的许可协议分配给部分文档。如需了解该局限性的详细讨论,请参阅[我们的研究论文](https://huggingface.co/papers/2506.05209)。若您发现本数据集存在许可协议标注错误的情况,请前往本仓库[发起讨论](https://github.com/r-three/common-pile/discussions/new)。 ## 其他版本 本文件为GitHub归档数据集的“原始版”。若您需要用于训练[Comma v0.1](https://huggingface.co/common-pile/comma-v0.1)的过滤版数据集,可前往[此处](https://huggingface.co/datasets/common-pile/github_archive_filtered)获取。 ## 引用 若您使用本数据集,请引用如下文献: bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} }
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作