nyuuzyou/gitflic-code

Name: nyuuzyou/gitflic-code
Creator: nyuuzyou
Published: 2024-07-09 13:45:37
License: 暂无描述

Hugging Face2024-07-09 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/nyuuzyou/gitflic-code

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从GitFlic平台上的代码仓库中编译而来，GitFlic是第一个基于Git版本控制系统的俄罗斯源代码存储和工作平台。数据集包含了12,527个仓库中的代码，覆盖了692种不同的文件类型。数据集经过去重和过滤，移除了二进制文件，最终提取了60 GB的唯一代码。每个条目代表一个文件，包括文件内容、识别出的编程语言和唯一的文件名。数据集的结构包括三个字段：file_text（文件内容）、language（编程语言）和file_name（文件名）。所有数据都位于训练集，没有验证集。数据集的创建者考虑了伦理问题，确保数据的收集和使用符合伦理标准。

This dataset was compiled from code repositories hosted on the GitFlic platform. GitFlic is the first Russian service for storing and working with source code, based on the Git version control system. The dataset includes code from 12,527 repositories, spanning 692 different file types. It has been deduplicated and filtered to remove binary files, resulting in 60 GB of unique code. Each entry in the dataset represents a single file, including its content, identified language, and a unique filename. The dataset structure includes three fields: file_text (file content), language (programming language), and file_name (file name). All examples are in the train split, with no validation split. The dataset curators have considered ethical considerations to ensure the ethical collection and use of the data.

提供机构：

nyuuzyou

原始信息汇总

GitFlic Code Dataset

数据集概述

该数据集从GitFlic平台托管的代码仓库中编译而成。GitFlic是基于Git版本控制系统的首个俄罗斯源代码存储和工作服务。

数据集包含来自12,527个仓库的代码，涵盖692种不同的文件类型，由github-linguist识别。经过去重和过滤二进制文件后，数据集包含60 GB的唯一代码，源自超过967 GB的分析数据。数据集包含5,983,358个唯一代码文件，每个条目代表一个文件，包括其内容、识别的语言和唯一文件名。