datajuicer/the-pile-philpaper-refined-by-data-juicer

Name: datajuicer/the-pile-philpaper-refined-by-data-juicer
Creator: datajuicer
Published: 2023-10-23 08:33:30
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是The Pile中PhilPaper数据集的精炼版本，由Data-Juicer工具处理，去除了原始数据集中的一些低质量样本，以提高数据集的质量。数据集通常用于预训练大型语言模型。数据集包含29,117个样本，保留了原始数据集的约88.82%。数据精炼过程包括多个过滤和映射操作，如清理电子邮件和链接、修复Unicode、标点符号和空格规范化、字母数字过滤、平均行长度过滤、字符重复过滤、标记词过滤、语言识别分数过滤、最大行长度过滤、困惑度过滤、特殊字符过滤、单词数量过滤和单词重复过滤等。

This dataset is a refined version of the PhilPaper subset from The Pile, processed using the Data-Juicer tool. It removes low-quality samples from the original dataset to improve overall data quality. The dataset is commonly used for pre-training Large Language Models (LLMs). It contains 29,117 samples, retaining approximately 88.82% of the original dataset. The data refinement process includes multiple filtering and mapping operations, such as cleaning email addresses and hyperlinks, fixing Unicode issues, normalizing punctuation and whitespace, alphanumeric filtering, average line length filtering, character repetition filtering, keyword filtering, language identification score filtering, maximum line length filtering, perplexity filtering, special character filtering, word count filtering, and word repetition filtering, among others.

提供机构：

datajuicer

原始信息汇总

The Pile -- PhilPaper (refined by Data-Juicer)

概述

数据集名称: The Pile -- PhilPaper (refined by Data-Juicer)
数据集版本: 经过Data-Juicer精炼的版本
数据集用途: 通常用于预训练大型语言模型
数据集大小: 约1.7GB（完整数据集）
语言: 英语
样本数量: 29,117个样本（保留了原始数据集的约88.82%）

数据集处理

处理工具: Data-Juicer
处理步骤:
- 清洗电子邮件
- 清洗链接
- 修复Unicode字符
- 标点符号规范化
- 空白规范化
- 字母数字过滤
- 平均行长度过滤
- 字符重复过滤
- 标记词过滤
- 语言识别分数过滤
- 最大行长度过滤
- 困惑度过滤
- 特殊字符过滤
- 词数过滤
- 词重复过滤
- 文档相似哈希去重

处理参数

项目名称: our-recipes-Philpaper
数据集路径: /path/to/the/original/dataset/
导出路径: Philpaper-refine-result.jsonl
子进程数量: 50
数据集缓存路径: /cache
开启跟踪器: true

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集