five

LaughingLogits/Stackless_Java_V2

收藏
Hugging Face2024-08-16 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/LaughingLogits/Stackless_Java_V2
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集通过从GitHub上抓取公共仓库来构建,包含单个Java文件。创建过程包括收集、清理和去重三个阶段。收集阶段通过GitHub API抓取10,500个公共仓库,并确保至少收集10,000个仓库。清理阶段排除了大于50MB的Java文件和少于10个单词的文件,并移除了自动生成的文件。去重阶段进行了精确去重和近重复去重,确保数据集与Java-Stack v2没有重叠。数据集的结构包括文件名、文件路径、文件内容、文件大小、语言、扩展名、仓库名称、仓库星标数、仓库分叉数、仓库开放问题数、仓库创建时间、仓库推送时间、文件内容的SHA值以及与Java-Stack v2近重复文件的索引。

This dataset creates a new Java dataset by scraping public repositories on GitHub. The dataset includes individual Java files rather than entire projects or code snippets. The dataset creation process involves three key stages: collection, cleaning, and deduplication. The collection stage scrapes 10,500 public repositories using the GitHub API, focusing on strong copyleft licenses such as GPL-2.0, GPL-3.0, or AGPL-3.0. The cleaning stage excludes files larger than 50MB and those with fewer than 10 words, and removes auto-generated files. The deduplication stage performs exact deduplication using the sha256 function and near-deduplication using the MinHashLSH algorithm. The dataset includes two configurations: Raw_Java and Stackless_Java_V2, each with train and test splits.
提供机构:
LaughingLogits
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作