five

Github Code Clean

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/codeparrot/github-code-clean
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是从GitHub收集的经过清洗的代码,为了提升模型训练的质量和多样性,进行了精细的筛选处理。在保证数据质量方面,数据集经过了严格的去重和过滤,剔除了低质量的代码以及包含敏感信息和个人识别信息的内容。该数据集涵盖了116种编程语言的代码,适用于代码生成、代码修复、代码解释以及其他与代码相关的任务。

This dataset consists of cleaned code collected from GitHub, which has undergone meticulous filtering and curation to improve the quality and diversity of model training. To ensure data quality, strict deduplication and filtering have been applied to the dataset, removing low-quality code as well as content containing sensitive information and personally identifiable information (PII). This dataset covers code across 116 programming languages and supports a wide range of code-related tasks including code generation, code repair, code explanation, and other code-centric tasks.
提供机构:
GitHub
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作