five

WizzF/Heap-Forge

收藏
Hugging Face2024-12-28 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/WizzF/Heap-Forge
下载链接
链接失效反馈
官方服务:
资源简介:
The Heap数据集是一个多语言代码数据集,包含57种编程语言的代码文件,旨在为大型语言模型(LLM)的评估提供无污染的数据。数据集通过GitHub API收集了多达50,000个公共仓库,并进行了清理和去重处理。每个文件包含多个特征,如文件名称、路径、内容、大小、语言、扩展名、行数、平均行长度、最大行长度、字母数字比例、仓库名称、星标数、分支数、开放问题数、许可证类型、提取日期等。数据集还包含与多个公开代码数据集(如The Stack V2、The Stack、Red Pajama、GitHub Code、CodeParrot)的精确和近似去重结果。

The Heap dataset is a multilingual code dataset comprising 57 languages, designed to facilitate LLM evaluation reproducibility. The dataset reduces contamination by prioritizing repositories with non-permissive licenses and collects up to 50,000 public GitHub repositories, filtered by license type, star count, and creation date. The dataset features include file content, size, language, repository details, and deduplication flags. The cleaning process excludes files larger than 10 MB and those with fewer than 10 words. Deduplication is performed using both exact and near deduplication methods against several open code datasets. The final dataset structure includes fields such as file name, path, content, size, language, repository details, and flags indicating exact and near duplicates against other datasets.
提供机构:
WizzF
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作