five

codeparrot/codeparrot-train-more-filtering

收藏
Hugging Face2022-06-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/codeparrot/codeparrot-train-more-filtering
下载链接
链接失效反馈
官方服务:
资源简介:
# CodeParrot 🦜 Dataset Cleaned and filtered (train) ## Dataset Description A dataset of Python files from Github. It is a more filtered version of the train split [codeparrot-clean-train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train) of [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean#codeparrot-%F0%9F%A6%9C-dataset-cleaned). The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7: - files with a mention of "test file" or "configuration file" or similar in the first 5 lines - files with high occurence of the keywords "test " or "config" - files without a mention of the keywords `def`, `for`, `while` and `class` - files that use the assignment operator ```=``` less than 5 times - files with ratio between number of characters and number of tokens after tokenization < 1.5
提供机构:
codeparrot
原始信息汇总

CodeParrot 🦜 Dataset Cleaned and filtered (train)

数据集描述

本数据集是从Github收集的Python文件,是对codeparrot-clean-train的进一步筛选版本。筛选过程旨在识别配置文件和测试文件,以及可能不会帮助模型学习代码的异常文件。筛选标准包括:

  • 在前5行中提及“test file”或“configuration file”或类似内容的文件,过滤概率为0.7。
  • 高频出现关键词“test”或“config”的文件。
  • 未提及关键词def, for, whileclass的文件。
  • 使用赋值运算符=少于5次的文件。
  • 字符数与token化后token数之比小于1.5的文件。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作