danbraunai/pile-uncopyrighted-tok

Name: danbraunai/pile-uncopyrighted-tok
Creator: danbraunai
Published: 2026-02-08 23:59:44
License: 暂无描述

Hugging Face2026-02-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted-tok

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: source: danbraunai/pile-uncopyrighted tokenizer: EleutherAI/gpt-neox-20b sequence_length: 513 --- # Pile Uncopyrighted (Tokenized) This is [danbraunai/pile-uncopyrighted](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted) tokenized with [`EleutherAI/gpt-neox-20b`](https://huggingface.co/EleutherAI/gpt-neox-20b). Each row contains a single `input_ids` column with 513 token IDs. Samples are concatenated with EOS tokens between them (following the approach in [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)), then reshaped into fixed-length sequences. ## Creation script ```python from datasets import DatasetDict, load_dataset from transformers import AutoTokenizer from spd.data import tokenize_and_concatenate SOURCE_REPO = "danbraunai/pile-uncopyrighted" TOKENIZER_NAME = "EleutherAI/gpt-neox-20b" N_CTX = 513 tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME) result = DatasetDict() for split in ["train", "val", "test"]: ds = load_dataset(SOURCE_REPO, split=split) tokenized = tokenize_and_concatenate( ds, tokenizer, column_name="text", max_length=N_CTX, add_bos_token=False, num_proc=10, to_lower=False, ) tokenized = tokenized.with_format(None) result[split] = tokenized result.push_to_hub("danbraunai/pile-uncopyrighted-tok") ```

提供机构：

danbraunai

5,000+

优质数据集

54 个

任务类型

进入经典数据集