datajuicer/redpajama-cc-2022-05-refined-by-data-juicer
收藏Hugging Face2023-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- data-juicer
- pretraining
size_categories:
- 10M<n<100M
---
# RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
A refined version of CommonCrawl-2022-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
**Notice**: Here is a small subset for previewing. The whole dataset is available [here](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) (About 265GB).
## Dataset Information
- Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)
## Refining Recipe
```yaml
# global parameters
project_name: 'Data-Juicer-recipes-cc-2022-05'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'
np: 50 # number of subprocess to process your dataset
open_tracer: true
# process schedule
# a list of several process operators with their arguments
process:
- document_simhash_deduplicator:
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4
- clean_email_mapper:
- clean_links_mapper:
- fix_unicode_mapper:
- punctuation_normalization_mapper:
- whitespace_normalization_mapper:
- alphanumeric_filter:
tokenization: false
min_ratio: 0.7514 # 3sigma
max_ratio: 0.8577 # 3sigmai -- 888003
- average_line_length_filter: # for code
max_len: 1500 # < 3sigma -- 447069
- character_repetition_filter:
rep_len: 10
max_ratio: 0.3 # > 3sigma -- 145890 samples
- flagged_words_filter:
lang: en
tokenization: true
max_ratio: 0.0012 # 3sigma -- 319395
- language_id_score_filter: # remove language filter
min_score: 0.791 # 3sigma -- 1823528
- maximum_line_length_filter: # for code
max_len: 5000 # < 3sigma -- 791612
- perplexity_filter:
lang: en
max_ppl: 5000 # < 3sigma -- 654459
- special_characters_filter:
min_ratio: 0.15 # > 3sigma
max_ratio: 0.35 # > 3sigma
- text_length_filter:
max_len: 59265 # 3sigma -- 1046590
- words_num_filter:
lang: en
tokenization: true
min_num: 20 # > 3sigma
max_num: 11860 # 3sigma -- 1036780
- word_repetition_filter:
lang: en
tokenization: true
rep_len: 10
max_ratio: 0.3117 # 3sigma -- 2089703
```
提供机构:
datajuicer
原始信息汇总
RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
数据集概述
- 数据集名称: RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
- 数据集类型: 预训练数据集
- 语言: 英语
- 标签: data-juicer, pretraining
- 数据量: 10M<n<100M
- 样本数量: 42,648,496 (保留了原始数据集的约45.34%)
数据集描述
该数据集是CommonCrawl-2022-05数据集的精炼版本,由Data-Juicer处理,去除了一些“不良”样本,以提高数据质量。通常用于预训练大型语言模型。
精炼过程
- 项目名称: Data-Juicer-recipes-cc-2022-05
- 数据集路径: /path/to/your/dataset
- 导出路径: /path/to/your/dataset.jsonl
- 子进程数量: 50
- 开启追踪: true
处理步骤
-
文档相似哈希去重:
- 分词方式: space
- 窗口大小: 6
- 小写转换: true
- 忽略模式: p{P}
- 块数: 6
- 汉明距离: 4
-
清理电子邮件:
-
清理链接:
-
修复Unicode:
-
标点符号规范化:
-
空白规范化:
-
字母数字过滤:
- 分词: false
- 最小比率: 0.7514
- 最大比率: 0.8577
-
平均行长度过滤:
- 最大长度: 1500
-
字符重复过滤:
- 重复长度: 10
- 最大比率: 0.3
-
标记词过滤:
- 语言: en
- 分词: true
- 最大比率: 0.0012
-
语言ID分数过滤:
- 最小分数: 0.791
-
最大行长度过滤:
- 最大长度: 5000
-
困惑度过滤:
- 语言: en
- 最大困惑度: 5000
-
特殊字符过滤:
- 最小比率: 0.15
- 最大比率: 0.35
-
文本长度过滤:
- 最大长度: 59265
-
单词数量过滤:
- 语言: en
- 分词: true
- 最小数量: 20
- 最大数量: 11860
-
单词重复过滤:
- 语言: en
- 分词: true
- 重复长度: 10
- 最大比率: 0.3117



