datajuicer/redpajama-cc-2020-05-refined-by-data-juicer

Name: datajuicer/redpajama-cc-2020-05-refined-by-data-juicer
Creator: datajuicer
Published: 2023-10-23 08:55:35
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer

下载链接

链接失效反馈

官方服务：

资源简介：

RedPajama -- CommonCrawl-2020-05数据集是经过Data-Juicer工具精炼的版本，旨在通过移除一些低质量样本来提高数据集的质量。该数据集主要用于预训练大型语言模型。数据集的预览子集可供查看，完整数据集大小约为297GB，包含42,612,596个样本，保留了原始数据集的约46.90%。精炼过程包括多个步骤，如文档去重、清理电子邮件和链接、修复Unicode字符、标点符号和空格规范化、以及多种过滤操作，如字母数字过滤、平均行长度过滤、字符重复过滤等。

The RedPajama -- CommonCrawl-2020-05 dataset is a refined version processed using the Data-Juicer tool, aiming to enhance dataset quality by eliminating low-quality samples. This dataset is primarily intended for pre-training Large Language Models (LLMs). A preview subset of the dataset is accessible for review, while the full dataset has a size of approximately 297 GB and contains 42,612,596 samples, retaining roughly 46.90% of the original dataset. The refinement workflow includes multiple steps: document deduplication, cleaning of emails and hyperlinks, correction of Unicode characters, normalization of punctuation and whitespace, as well as diverse filtering operations such as alphanumeric filtering, average line length filtering, and character repetition filtering, among others.

提供机构：

datajuicer

原始信息汇总

RedPajama -- CommonCrawl-2020-05 (refined by Data-Juicer)

概述

数据集名称: RedPajama -- CommonCrawl-2020-05 (refined by Data-Juicer)
数据集类型: 预训练数据集
语言: 英语
标签: data-juicer, pretraining
数据量: 10M<n<100M
许可证: Apache-2.0
任务类别: 文本生成

数据集信息

样本数量: 42,612,596
保留比例: 约46.90%（从原始数据集中保留）

精炼配方

项目名称: Data-Juicer-recipes-cc-2020-05
数据集路径: /path/to/your/dataset
导出路径: /path/to/your/dataset.jsonl
子进程数量: 50
开启追踪: true

处理流程

文档相似哈希去重:
- 分词方式: space
- 窗口大小: 6
- 小写转换: true
- 忽略模式: p{P}
- 块数: 6
- 汉明距离: 4
清理电子邮件:
清理链接:
修复Unicode:
标点符号规范化:
空白规范化:
字母数字过滤:
- 分词方式: false
- 最小比例: 0.7469
- 最大比例: 0.8609
平均行长度过滤:
- 最大长度: 1500
字符重复过滤:
- 重复长度: 10
- 最大比例: 0.3
标记词过滤:
- 语言: en
- 分词方式: true
- 最大比例: 0.002
语言ID分数过滤:
- 最小分数: 0.774
最大行长度过滤:
- 最大长度: 5000
困惑度过滤:
- 语言: en
- 最大困惑度: 5000
特殊字符过滤:
- 最小比例: 0.15
- 最大比例: 0.35
文本长度过滤:
- 最大长度: 68161
单词数量过滤:
- 语言: en
- 分词方式: true
- 最小数量: 20
- 最大数量: 13644
单词重复过滤:
- 语言: en
- 分词方式: true
- 重复长度: 10
- 最大比例: 0.328

5,000+

优质数据集

54 个

任务类型

进入经典数据集