nyuuzyou/moshub-code

Name: nyuuzyou/moshub-code
Creator: nyuuzyou
Published: 2024-07-10 16:49:25
License: 暂无描述

Hugging Face2024-07-10 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/nyuuzyou/moshub-code

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从Mos.Hub平台上的代码仓库编译而来的，包含16,130个仓库的代码，涵盖304种不同的文件类型。经过去重和过滤，最终包含32 GB的唯一代码，涉及15,740,822个唯一代码文件。每个文件包括其内容、识别的语言和唯一的文件名。数据集包括多种编程语言，每个文件的语言由github-linguist识别。数据集仅包含训练集，没有验证集。数据集的收集考虑了伦理因素，确保了数据的合法使用。

This dataset was compiled from code repositories hosted on the Mos.Hub platform, including 16,130 repositories spanning 304 different file types. After deduplication and filtering, it contains 32 GB of unique code from over 794 GB of analyzed data, involving 15,740,822 unique code files. Each file includes its content, identified language, and a unique filename. The dataset includes multiple programming languages, with each files language identified by github-linguist. The dataset only contains a training set, with no validation set. The collection of the dataset considered ethical factors to ensure the lawful use of the data.

提供机构：

nyuuzyou

原始信息汇总

Mos.Hub Code Dataset

数据集概述

该数据集从hub.mos.ru平台（Mos.Hub）托管的代码仓库中编译而成。Mos.Hub是一个基于Git版本控制系统的源代码存储和工作服务。

数据集包含来自16,130个仓库的代码，涵盖304种不同的文件类型，由github-linguist识别。经过去重和过滤二进制文件后，数据集包含32 GB的唯一代码，这些代码从超过794 GB的分析数据中提取。数据集包含15,740,822个唯一代码文件。每个条目代表一个文件，包括其内容、识别的语言和唯一的文件名。