datajuicer/redpajama-book-refined-by-data-juicer

Name: datajuicer/redpajama-book-refined-by-data-juicer
Creator: datajuicer
Published: 2023-10-23 08:59:08
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datajuicer/redpajama-book-refined-by-data-juicer

下载链接

链接失效反馈

官方服务：

资源简介：

RedPajama -- Book数据集是RedPajama中Book数据集的精炼版本，通过Data-Juicer工具移除了原始数据集中的一些低质量样本，以提高数据集的质量。该数据集包含195,983个样本，保留了原始数据集的约95.51%。该数据集主要用于预训练大型语言模型。精炼过程包括清理电子邮件、链接、修复Unicode、标点符号和空格规范化，以及应用多种过滤器来去除不符合标准的样本。

提供机构：

datajuicer

原始信息汇总

RedPajama -- Book (refined by Data-Juicer)

概述

这是一个由Data-Juicer精炼的RedPajama图书数据集版本，通过移除原始数据集中的一些“不良”样本，提高了数据集的质量。该数据集通常用于预训练大型语言模型。

数据集信息

样本数量：195,983（保留了原始数据集的约95.51%）

精炼配方

yaml

全局参数

project_name: Data-Juicer-recipes-book dataset_path: /path/to/your/dataset # 数据集目录或文件路径 export_path: /path/to/your/dataset.jsonl

np: 50 # 用于处理数据集的子进程数量 open_tracer: true

处理流程

一系列处理操作及其参数

process:

clean_email_mapper:
clean_links_mapper:
fix_unicode_mapper:
punctuation_normalization_mapper:
whitespace_normalization_mapper:
alphanumeric_filter: tokenization: false min_ratio: 0.55 # <3sigma (0.697) max_ratio: 0.854 # 3sigma
average_line_length_filter: # 针对代码 max_len: 500 # >3sigma (364)
character_repetition_filter: rep_len: 10 max_ratio: 0.2 # >3sigma (0.12)
flagged_words_filter: lang: en tokenization: true max_ratio: 0.00047 # 3sigma
language_id_score_filter: # 移除语言过滤器 min_score: 0.2
maximum_line_length_filter: # 针对代码 max_len: 13381 # 3sigma
perplexity_filter: lang: en max_ppl: 6000 # <3sigma (16516)
special_characters_filter: max_ratio: 0.5 # >3sigma (0.32)
words_num_filter: lang: en tokenization: true min_num: 1000 max_num: 539754 # 3sigma
word_repetition_filter: lang: en tokenization: true rep_len: 10 max_ratio: 0.194 # 3sigma
document_simhash_deduplicator: tokenization: space window_size: 6 lowercase: true ignore_pattern: p{P} num_blocks: 6 hamming_distance: 4

5,000+

优质数据集

54 个

任务类型

进入经典数据集