common-pile/github_archive
收藏Hugging Face2025-06-06 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/github_archive
下载链接
链接失效反馈官方服务:
资源简介:
GitHub Archive数据集是从GitHub Archive的公共BigQuery表提取的事件数据,包括自2011年以来的所有问题、拉取请求及其评论,并聚合成线程。数据集中移除了来自机器人的评论,并且只保留了具有Blue Oak Council批准许可的仓库的线程。数据集使用了PyMarkdown将GitHub风格的markdown转换为纯文本。每个文档的许可信息可在示例的`metadata`字段中的`license`条目找到。这是一个未经过滤的原始版本的数据集,用于收集、处理和准备数据集的代码可在common-pile GitHub仓库中找到。
The GitHub Archive dataset is extracted from the public BigQuery table of the GitHub Archive, including all issues, pull requests, and their comments since 2011 aggregated into threads. The dataset has filtered out comments from bots and only kept threads from repositories with a Blue Oak Council-approved license. PyMarkdown was used to convert GitHub-flavored markdown to plain text. The licensing information for each document can be found in the `license` entry of the `metadata` field of each example. This is the raw, unfiltered version of the dataset, and the code for collecting, processing, and preparing the dataset can be found in the common-pile GitHub repository.
提供机构:
common-pile



