LaughingLogits/Stackless_Java_V2

Name: LaughingLogits/Stackless_Java_V2
Creator: LaughingLogits
Published: 2024-08-16 11:32:07
License: 暂无描述

Hugging Face2024-08-16 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/LaughingLogits/Stackless_Java_V2

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集通过从GitHub上抓取公共仓库来构建，包含单个Java文件。创建过程包括收集、清理和去重三个阶段。收集阶段通过GitHub API抓取10,500个公共仓库，并确保至少收集10,000个仓库。清理阶段排除了大于50MB的Java文件和少于10个单词的文件，并移除了自动生成的文件。去重阶段进行了精确去重和近重复去重，确保数据集与Java-Stack v2没有重叠。数据集的结构包括文件名、文件路径、文件内容、文件大小、语言、扩展名、仓库名称、仓库星标数、仓库分叉数、仓库开放问题数、仓库创建时间、仓库推送时间、文件内容的SHA值以及与Java-Stack v2近重复文件的索引。

This dataset creates a new Java dataset by scraping public repositories on GitHub. The dataset includes individual Java files rather than entire projects or code snippets. The dataset creation process involves three key stages: collection, cleaning, and deduplication. The collection stage scrapes 10,500 public repositories using the GitHub API, focusing on strong copyleft licenses such as GPL-2.0, GPL-3.0, or AGPL-3.0. The cleaning stage excludes files larger than 50MB and those with fewer than 10 words, and removes auto-generated files. The deduplication stage performs exact deduplication using the sha256 function and near-deduplication using the MinHashLSH algorithm. The dataset includes two configurations: Raw_Java and Stackless_Java_V2, each with train and test splits.

提供机构：

LaughingLogits

5,000+

优质数据集

54 个

任务类型

进入经典数据集