JonasGeiping/the_pile_WordPiecex32768_8eb2d0ea9da707676c81314c4ea04507

Name: JonasGeiping/the_pile_WordPiecex32768_8eb2d0ea9da707676c81314c4ea04507
Creator: JonasGeiping
Published: 2023-06-13 16:25:35
License: 暂无描述

Hugging Face2023-06-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/JonasGeiping/the_pile_WordPiecex32768_8eb2d0ea9da707676c81314c4ea04507

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个预处理和标记化的数据集，用于cramming项目。原始数据源是The Pile，一个825 GiB的多样化开源语言建模数据集，由22个较小的高质量数据集组合而成。该数据集仅包含训练分割，且为英文。数据集的创建过程包括预处理、标记化、数据清理和数据排序等步骤。使用该数据集时需要注意其进一步过滤和排序可能带来的未测试的意外后果。数据集由Jonas Geiping过滤、排序和预处理，原始数据集主要由Leo Gao和Stella Biderman策展。

提供机构：

JonasGeiping

原始信息汇总

数据集描述

数据集概述

这是一个为cramming项目预处理的、标记化的数据集。该版本对应于特定的数据集构建设置，原始数据源是Pile，一个825 GiB的多样性、开源语言建模数据集，由22个较小的、高质量的数据集组合而成。

语言

该数据集为英语（EN）。

数据分割

该预处理子集仅包含训练集。

数据集创建

创建该数据集的配置如下：

yaml

这是Pile的一个切片

name: the_pile defaults:

sources:
- the_pile

预处理

normalizer: force_lowercase: True strip_accents: True force_english_keyboard: True whitespace_escape: False tokenizer: WordPiece vocab_size: 32768

数据集形成

seq_length: 128 include_cls_token_in_corpus: False include_sep_token_in_corpus: True use_type_ids: False max_entries_in_raw_dataset: 16e6 max_seq_in_tokenized_dataset: 85e6

数据清洗

named_entity_simplification: False remove_whitespaces: False remove_trash: True trash_cutoff: 0.25 deduplicate_entries: True deduplication_threshold: 75

数据顺序

ordering: sentence-length-curriculum

使用数据的注意事项

该训练数据在常规预处理之外进行了进一步的过滤和排序。这些修改未经测试，可能存在未预见的后果。

附加信息

数据集策展人

该数据集是由Jonas Geiping制作的Pile的过滤、排序和预处理子集。原始数据集主要由Leo Gao和Stella Biderman策展，并得到Pile论文其他作者的协助。

许可信息

请参考具体子集的许可信息：https://huggingface.co/datasets/EleutherAI/pile

引用信息

cramming项目的过滤版本： bibtex @article{geiping_cramming_2022, title = {Cramming: {{Training}} a {{Language Model}} on a {{Single GPU}} in {{One Day}}}, shorttitle = {Cramming}, author = {Geiping, Jonas and Goldstein, Tom}, year = {2022}, month = dec, eprint = {2212.14034}, primaryclass = {cs}, publisher = {{arXiv}}, doi = {10.48550/arXiv.2212.14034}, url = {http://arxiv.org/abs/2212.14034}, urldate = {2023-01-10}, archiveprefix = {arxiv}, keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning}, journal = {arxiv:2212.14034[cs]} }

原始数据策展： bibtex @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } @article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集