资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- data-juicer
- pretraining
size_categories:
- 1M<n<10M
---
# The Pile -- USPTO (refined by Data-Juicer)
A refined version of USPTO dataset in The Pile by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
**Notice**: Here is a small subset for previewing. The whole dataset is available [here](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) (About 18G).
## Dataset Information
- Number of samples: 4,516,283 (Keep ~46.77% from the original dataset)
## Refining Recipe
```yaml
# global parameters
project_name: 'Data-Juicer-recipes-uspto'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl' # path to your dataset result file
np: 50 # number of subprocess to process your dataset
open_tracer: true
# process schedule
# a list of several process operators with their arguments
process:
- clean_email_mapper:
- clean_links_mapper:
- fix_unicode_mapper:
- punctuation_normalization_mapper:
- whitespace_normalization_mapper:
- alphanumeric_filter:
tokenization: false
min_ratio: 0.7 # <3sigma (0.758)
- average_line_length_filter: # for code
max_len: 2000 # >3sigma (1307)
- character_repetition_filter:
rep_len: 10
max_ratio: 0.2 # >3sigma (0.189)
- flagged_words_filter:
lang: en
tokenization: true
max_ratio: 0.0016 # 3sigma
- language_id_score_filter:
min_score: 0.6
- maximum_line_length_filter: # for code
max_len: 3061 # 3sigma
- perplexity_filter:
lang: en
max_ppl: 4000 # 3sigma
- special_characters_filter:
max_ratio: 0.3 # > 3sigma (0.274)
- text_length_filter:
max_len: 21556 # 3sigma
- words_num_filter:
lang: en
tokenization: true
min_num: 100
max_num: 6000 # 3sigma
- word_repetition_filter:
lang: en
tokenization: true
rep_len: 10
max_ratio: 0.169 # 3sigma
- document_simhash_deduplicator:
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4
```
license: Apache-2.0
task_categories:
- 文本生成
language:
- 英语
tags:
- Data-Juicer
- 预训练
size_categories:
- 100万 < 样本数 < 1000万
# 《The Pile》数据集之USPTO子集(经Data-Juicer精炼)
本数据集是《The Pile》数据集中USPTO子集经Data-Juicer(https://github.com/alibaba/data-juicer)精炼后的版本,通过移除原始数据集中的低质量样本以提升整体数据品质。
该数据集通常用于大语言模型(Large Language Model,LLM)的预训练任务。
**注意**:当前仅为用于预览的小型子集,完整数据集可通过[此链接](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl)获取(文件大小约18GB)。
## 数据集详情
- 样本数量:4,516,283条(保留原始数据集约46.77%的样本)
## 精炼流程
yaml
# 全局参数
project_name: 'Data-Juicer-recipes-uspto'
dataset_path: '/path/to/your/dataset' # 数据集目录或文件路径
export_path: '/path/to/your/dataset.jsonl' # 数据集结果文件路径
np: 50 # 数据集处理的并行进程数
open_tracer: true
# 处理流程
# 由多个处理算子及其参数组成的列表
process:
- clean_email_mapper: # 清理电子邮件映射算子
- clean_links_mapper: # 清理链接映射算子
- fix_unicode_mapper: # 修复Unicode编码映射算子
- punctuation_normalization_mapper: # 标点符号规范化映射算子
- whitespace_normalization_mapper: # 空白字符规范化映射算子
- alphanumeric_filter:
tokenization: false
min_ratio: 0.7 # 阈值为3σ(0.758)
- average_line_length_filter: # 平均行长度过滤算子(针对代码场景)
max_len: 2000 # 阈值为3σ(1307)
- character_repetition_filter:
rep_len: 10
max_ratio: 0.2 # 阈值为3σ(0.189)
- flagged_words_filter:
lang: en
tokenization: true
max_ratio: 0.0016 # 阈值为3σ
- language_id_score_filter:
min_score: 0.6
- maximum_line_length_filter: # 最大行长度过滤算子(针对代码场景)
max_len: 3061 # 阈值为3σ
- perplexity_filter: # 困惑度过滤算子
lang: en
max_ppl: 4000 # 阈值为3σ
- special_characters_filter:
max_ratio: 0.3 # 阈值为3σ(0.274)
- text_length_filter:
max_len: 21556 # 阈值为3σ
- words_num_filter:
lang: en
tokenization: true
min_num: 100
max_num: 6000 # 阈值为3σ
- word_repetition_filter:
lang: en
tokenization: true
rep_len: 10
max_ratio: 0.169 # 阈值为3σ
- document_simhash_deduplicator: # 文档SimHash去重算子
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: 'p{P}'
num_blocks: 6
hamming_distance: 4