five

github_archive_filtered

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/github_archive_filtered
下载链接
链接失效反馈
官方服务:
资源简介:
# GitHub Archive ## Description According to [GitHub’s terms of service](https://docs.github.com/en/site-policy/github-terms/github-terms-of-service), issues and pull request descriptions—along with their comments—inherit the license of their associated repository. To collect this data, we used the [GitHub Archive’s](https://www.gharchive.org/) public BigQuery table of events to extract all issue, pull request, and comment events since 2011 and aggregated them into threads. The table appeared to be missing “edit” events so the text from each comment is the original from when it was first posted. We filtered out comments from bots. This resulted in approximately 177 million threads across 19 million repositories. We then removed threads whose repositories did not have a Blue Oak Council-approved license. License information for each repository comes from either 1) the “public-data:github_repos” BigQuery Table, 2) metadata from the StackV2, or 3) the GitHub API. License filtering left 10 million repositories. PyMarkdown was used to convert from GitHub-flavored markdown to plain text. When parsing failed, the raw markdown was kept. Per-document license information is available in the `license` entry of the `metadata` field of each example. Code for collecting, processing, and preparing this dataset is available in the [common-pile GitHub repo](https://github.com/r-three/common-pile). ## Dataset Statistics | Documents | UTF-8 GB | |-------------|-----------| | 23,358,580 | 40.4 | ## License Issues While we aim to produce datasets with completely accurate licensing information, license laundering and inaccurate metadata can cause us to erroneously assign the incorrect license to some documents (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)). If you believe you have found an instance of incorrect licensing in this dataset, please [start a discussion](https://github.com/r-three/common-pile/discussions/new) on this repository. ## Other Versions This is the "filtered" version of the GitHub Archive dataset. If you are looking for the raw version, you can find it [here](https://huggingface.co/datasets/common-pile/github_archive_raw). ## Citation If you use this dataset, please cite: ```bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} } ```

## GitHub存档(GitHub Archive) ## 数据集描述 根据[GitHub服务条款](https://docs.github.com/zh/site-policy/github-terms/github-terms-of-service),议题(issue)与拉取请求(pull request)的描述及其附带评论,将继承其所属代码仓库的许可证。 为采集本数据集,我们依托[GitHub存档(GitHub Archive)](https://www.gharchive.org/)公开的BigQuery事件数据表,提取了2011年以来所有议题、拉取请求及评论事件,并将其聚合为话题线程(thread)。 该数据表似乎缺失“编辑”事件,因此每条评论的文本均为其首次发布时的原始内容。 我们已过滤掉来自机器人(bot)的评论。 最终得到约1.77亿条话题线程,覆盖1900万个代码仓库。 随后我们移除了所属仓库未获得蓝橡树委员会(Blue Oak Council)认证许可证的话题线程。 每个代码仓库的许可证信息来源于以下三种途径之一:1)`public-data:github_repos` BigQuery数据表;2)StackV2的元数据(metadata);3)GitHub应用程序编程接口(GitHub API)。 经过许可证过滤后,剩余1000万个代码仓库。 我们使用PyMarkdown将GitHub风格的Markdown(GitHub-flavored markdown)转换为纯文本。 若解析失败,则保留原始Markdown格式。 每条数据样本的元数据(metadata)字段中的`license`条目,均包含对应文档的许可证信息。 本数据集的采集、处理与制备代码已开源至[common-pile GitHub仓库](https://github.com/r-three/common-pile)。 ## 数据集统计 | 文档数量 | UTF-8 存储量(GB) | |-------------|---------------------| | 23,358,580 | 40.4 | ## 许可证相关问题 尽管我们致力于生成许可证信息完全准确的数据集,但许可证洗白(license laundering)与元数据不准确的问题,可能导致我们误将错误的许可证分配给部分文档。如需进一步讨论此局限性,请参阅[我们的论文](https://huggingface.co/papers/2506.05209)。 若您发现本数据集存在许可证分配错误的情况,请前往[该仓库发起讨论](https://github.com/r-three/common-pile/discussions/new)。 ## 其他版本 本数据集为GitHub存档数据集的"filtered"版本。若您需要原始版本,可前往[此处](https://huggingface.co/datasets/common-pile/github_archive_raw)获取。 ## 引用说明 若您使用本数据集,请引用如下文献: bibtex @article{kandpal2025common, title={{The Common Pile v0.1: 一个8TB的公有领域与开放授权文本数据集}}, author={Nikhil Kandpal、Brian Lester、Colin Raffel、Sebastian Majstorovic、Stella Biderman、Baber Abbasi、Luca Soldaini、Enrico Shippole、A. Feder Cooper、Aviya Skowron、Shayne Longpre、Lintang Sutawika、Alon Albalak、Zhenlin Xu、Guilherme Penedo、Loubna Ben、Elie Bakouch、John David、Honglu Fan、Dashiell Stander、Guangyu Song、Aaron Gokaslan、John Kirchenbauer、Tom Goldstein、Brian R、Bhavya Kailkhura、Tyler Murray}, journal={arXiv预印本}, year={2025} }
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作