codeparrot/codeparrot-clean

Name: codeparrot/codeparrot-clean
Creator: codeparrot
Published: 2022-10-10 15:23:51
License: 暂无描述

Hugging Face2022-10-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/codeparrot-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- tags: - python - code --- # CodeParrot 🦜 Dataset Cleaned ## What is it? A dataset of Python files from Github. This is the deduplicated version of the [codeparrot](https://huggingface.co/datasets/transformersbook/codeparrot). ## Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: - Deduplication - Remove exact matches - Filtering - Average line length < 100 - Maximum line length < 1000 - Alpha numeric characters fraction > 0.25 - Remove auto-generated files (keyword search) For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot). ## Splits The dataset is split in a [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train) and [validation](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid) split used for training and evaluation. ## Structure This dataset has ~50GB of code and 5361373 files. ```python DatasetDict({ train: Dataset({ features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'], num_rows: 5361373 }) }) ```

提供机构：

codeparrot

原始信息汇总

CodeParrot 🦜 Dataset Cleaned

概述

CodeParrot 🦜 Dataset Cleaned 是一个经过去重和清洗的Python代码文件数据集，源自GitHub。

数据处理

去重：移除完全相同的文件。
过滤：
- 平均行长度小于100。
- 最大行长度小于1000。
- 字母数字字符比例大于0.25。
- 移除自动生成的文件（通过关键词搜索）。

数据分割

数据集分为训练集和验证集：

训练集：用于模型训练。
验证集：用于模型评估。

数据结构

大小：约50GB。
文件数量：5361373个文件。
特征：包括仓库名称、路径、副本数、大小、内容、许可证、哈希值、平均行长度、最大行长度、字母数字字符比例、是否自动生成。

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个经过清洗的Python代码集合，源自GitHub，包含约536万份文件，总大小约50GB。它通过去重和过滤步骤（如移除重复项、限制行长度和筛选非自动生成文件）优化，适用于代码生成模型的训练和评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集