datajuicer/redpajama-cc-2021-04-refined-by-data-juicer
收藏Hugging Face2023-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- data-juicer
- pretraining
size_categories:
- 10M<n<100M
---
# RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
A refined version of CommonCrawl-2021-04 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
**Notice**: Here is a small subset for previewing. The whole dataset is available [here](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) (About 284GB).
## Dataset Information
- Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)
## Refining Recipe
```yaml
# global parameters
project_name: 'Data-Juicer-recipes-cc-2021-04'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'
np: 50 # number of subprocess to process your dataset
open_tracer: true
# process schedule
# a list of several process operators with their arguments
process:
- document_simhash_deduplicator:
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4
- clean_email_mapper:
- clean_links_mapper:
- fix_unicode_mapper:
- punctuation_normalization_mapper:
- whitespace_normalization_mapper:
- alphanumeric_filter:
tokenization: false
min_ratio: 0.7494 # 3sigma
max_ratio: 0.8595 # 3sigma -- 1001790
- average_line_length_filter: # for code
max_len: 1500 # < 3sigma (2817) -- 541131
- character_repetition_filter:
rep_len: 10
max_ratio: 0.3 # > 3sigma (0.1463) -- 159152
- flagged_words_filter:
lang: en
tokenization: true
max_ratio: 0.0019 # 3sigma -- 184714
- language_id_score_filter: # remove language filter
min_score: 0.786 # 3sigma -- 1995115
- maximum_line_length_filter: # for code
max_len: 5000 # < 3sigma -- 1076085
- perplexity_filter:
lang: en
max_ppl: 5000 # < 3sigma -- 906649
- special_characters_filter:
min_ratio: 0.15 # > 3sigma
max_ratio: 0.35 # > 3sigma -- 1046590
- text_length_filter:
max_len: 61592 # 3sigma -- 1114727
- words_num_filter:
lang: en
tokenization: true
min_num: 20 # > 3sigma
max_num: 12241 # 3sigma -- 1120334
- word_repetition_filter:
lang: en
tokenization: true
rep_len: 10
max_ratio: 0.3105 # 3sigma -- 2234933
```
提供机构:
datajuicer
原始信息汇总
RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
概述
- 数据集名称: RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
- 数据集类型: 预训练数据集
- 语言: 英语
- 标签: data-juicer, pretraining
- 数据集大小: 10M<n<100M
- 许可证: Apache-2.0
- 任务类别: 文本生成
数据集信息
- 样本数量: 44,724,752
- 保留比例: 约45.23%(从原始数据集中保留)
数据集处理流程
-
全局参数:
- 项目名称: Data-Juicer-recipes-cc-2021-04
- 数据集路径: /path/to/your/dataset
- 导出路径: /path/to/your/dataset.jsonl
- 子进程数量: 50
- 开启跟踪: true
-
处理步骤:
- 文档去重:
- 分词方式: space
- 窗口大小: 6
- 小写转换: true
- 忽略模式: p{P}
- 块数量: 6
- 汉明距离: 4
- 清理电子邮件:
- 清理链接:
- 修复Unicode:
- 标点符号规范化:
- 空白规范化:
- 字母数字过滤:
- 分词方式: false
- 最小比例: 0.7494
- 最大比例: 0.8595
- 平均行长度过滤:
- 最大长度: 1500
- 字符重复过滤:
- 重复长度: 10
- 最大比例: 0.3
- 标记词过滤:
- 语言: en
- 分词方式: true
- 最大比例: 0.0019
- 语言ID分数过滤:
- 最小分数: 0.786
- 最大行长度过滤:
- 最大长度: 5000
- 困惑度过滤:
- 语言: en
- 最大困惑度: 5000
- 特殊字符过滤:
- 最小比例: 0.15
- 最大比例: 0.35
- 文本长度过滤:
- 最大长度: 61592
- 单词数量过滤:
- 语言: en
- 分词方式: true
- 最小数量: 20
- 最大数量: 12241
- 单词重复过滤:
- 语言: en
- 分词方式: true
- 重复长度: 10
- 最大比例: 0.3105
- 文档去重:



