jblitzar/github-python
收藏Hugging Face2025-07-30 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/jblitzar/github-python
下载链接
链接失效反馈官方服务:
资源简介:
GitHub-Python数据集是一个从GitHub上提取的Python代码语料库,包含两个互补的子集:一个是严格遵循宽松许可的文件子集,适合商业再分配或模型训练;另一个是更广泛的爬取集合,包含了不同类型的许可文件,适合在许可混合可接受的情况下进行分析或预训练。数据集经过去重、格式化和清理,移除了API密钥和凭据,以保证安全和合规性。数据集使用自定义的分词方案,并且提供了相应的词汇文件。数据集的收集遵循特定的方法论,包括仓库发现、文件过滤、许可合规性检查、去重、格式化和清理以及秘密信息删除。
The GitHub-Python dataset is a corpus of Python code extracted from GitHub, containing two complementary subsets: one subset of files strictly adhering to permissive licenses, suitable for commercial redistribution or model training; and a broader crawl collection that includes files under various licenses, suitable for analysis or pre-training where license mixing is acceptable. The dataset has been deduplicated, formatted, and cleaned, with API keys and credentials removed to ensure security and compliance. The dataset uses a custom tokenization scheme and provides an accompanying vocabulary file. The dataset collection follows a specific methodology, including repository discovery, file filtering, license compliance checking, deduplication, formatting and cleaning, and secret information redaction.
提供机构:
jblitzar



