datajuicer/redpajama-stack-code-refined-by-data-juicer

Name: datajuicer/redpajama-stack-code-refined-by-data-juicer
Creator: datajuicer
Published: 2023-10-23 08:47:08
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer

下载链接

链接失效反馈

官方服务：

资源简介：

RedPajama和TheStack的Github代码数据集经过Data-Juicer工具的精炼处理，去除了原始数据集中的一些低质量样本，以提高数据集的质量。该数据集通常用于预训练大型语言模型。数据集的样本数量为49,279,344个，保留了原始数据集的约52.09%。精炼过程包括清理电子邮件、链接、修复Unicode、标点符号和空格规范化、清理版权信息等步骤，并应用了多种过滤器来去除不符合条件的样本。最后，通过文档simhash去重器合并并去除了重复样本。

提供机构：

datajuicer

原始信息汇总

RedPajama & TheStack -- Github Code (refined by Data-Juicer)

数据集概述

数据集名称: RedPajama & TheStack -- Github Code (refined by Data-Juicer)
数据集用途: 通常用于预训练大型语言模型。
数据集版本: 经过Data-Juicer精炼的版本，去除了部分“不良”样本，提高了数据质量。
数据集大小: 约232GB（完整数据集）
样本数量: 49,279,344个样本（保留了原始数据集的约52.09%）

精炼过程

RedPajama代码精炼

全局参数:
- project_name: Data-Juicer-recipes-code-rp
- dataset_path: /path/to/your/dataset
- export_path: /path/to/your/dataset.jsonl
- np: 50
- open_tracer: true
处理流程:
- clean_email_mapper
- clean_links_mapper
- fix_unicode_mapper
- punctuation_normalization_mapper
- whitespace_normalization_mapper
- clean_copyright_mapper
- alphanumeric_filter:
  - tokenization: False
  - min_ratio: 0.4
  - max_ratio: 0.8
- alphanumeric_filter:
  - tokenization: True
  - min_ratio: 1.5
  - max_ratio: 3
- average_line_length_filter:
  - min_len: 15
  - max_len: 100
- character_repetition_filter:
  - rep_len: 10
  - min_ratio: 0.05
  - max_ratio: 0.3
- maximum_line_length_filter:
  - min_len: 50
  - max_len: 500
- text_length_filter:
  - min_len: 300
- words_num_filter:
  - lang: en
  - tokenization: False
  - min_num: 30
  - max_num: 5000
- word_repetition_filter:
  - lang: en
  - tokenization: False
  - rep_len: 10
  - max_ratio: 0.1
- document_simhash_deduplicator:
  - tokenization: space
  - window_size: 6
  - lowercase: true
  - ignore_pattern: p{P}
  - num_blocks: 6
  - hamming_distance: 4

TheStack代码精炼（仅max_stars_count >= 20）

全局参数:
- project_name: Data-Juicer-recipes-the-stack
- dataset_path: /path/to/your/dataset
- export_path: /path/to/your/dataset.jsonl
- text_key: content
- np: 50
- open_tracer: true
处理流程:
- clean_email_mapper
- clean_links_mapper
- fix_unicode_mapper
- punctuation_normalization_mapper
- whitespace_normalization_mapper
- clean_copyright_mapper
- alphanumeric_filter:
  - tokenization: false
  - min_ratio: 0.2
  - max_ratio: 0.9163
- alphanumeric_filter:
  - tokenization: true
  - min_ratio: 0.546
  - max_ratio: 3.65
- average_line_length_filter:
  - min_len: 10
  - max_len: 150
- character_repetition_filter:
  - max_ratio: 0.36
- maximum_line_length_filter:
  - max_len: 1000
- text_length_filter:
  - max_len: 96714
- words_num_filter:
  - min_num: 20
  - max_num: 6640
- word_repetition_filter:
  - rep_len: 10
  - max_ratio: 0.357
- document_simhash_deduplicator:
  - tokenization: space
  - window_size: 6
  - lowercase: true
  - ignore_pattern: p{P}
  - num_blocks: 6
  - hamming_distance: 4

合并和去重样本

全局参数:
- project_name: Data-Juicer-recipes-code
- dataset_path: /path/to/your/dataset
- export_path: /path/to/your/dataset.jsonl
- np: 50
- open_tracer: true
处理流程:
- document_simhash_deduplicator:
  - tokenization: space
  - window_size: 6
  - lowercase: true
  - ignore_pattern: p{P}
  - num_blocks: 6
  - hamming_distance: 4

5,000+

优质数据集

54 个

任务类型

进入经典数据集