codeparrot/codeparrot-train-more-filtering

Name: codeparrot/codeparrot-train-more-filtering
Creator: codeparrot
Published: 2022-06-21 17:54:51
License: 暂无描述

Hugging Face2022-06-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/codeparrot-train-more-filtering

下载链接

链接失效反馈

官方服务：

资源简介：

# CodeParrot 🦜 Dataset Cleaned and filtered (train) ## Dataset Description A dataset of Python files from Github. It is a more filtered version of the train split [codeparrot-clean-train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train) of [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean#codeparrot-%F0%9F%A6%9C-dataset-cleaned). The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7: - files with a mention of "test file" or "configuration file" or similar in the first 5 lines - files with high occurence of the keywords "test " or "config" - files without a mention of the keywords `def`, `for`, `while` and `class` - files that use the assignment operator ```=``` less than 5 times - files with ratio between number of characters and number of tokens after tokenization < 1.5

提供机构：

codeparrot

原始信息汇总

CodeParrot 🦜 Dataset Cleaned and filtered (train)

数据集描述

本数据集是从Github收集的Python文件，是对codeparrot-clean-train的进一步筛选版本。筛选过程旨在识别配置文件和测试文件，以及可能不会帮助模型学习代码的异常文件。筛选标准包括：

在前5行中提及“test file”或“configuration file”或类似内容的文件，过滤概率为0.7。
高频出现关键词“test”或“config”的文件。
未提及关键词def, for, while 和 class的文件。
使用赋值运算符=少于5次的文件。
字符数与token化后token数之比小于1.5的文件。

5,000+

优质数据集

54 个

任务类型

进入经典数据集