five

danbraunai/pile-uncopyrighted-tok

收藏
Hugging Face2026-02-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted-tok
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: source: danbraunai/pile-uncopyrighted tokenizer: EleutherAI/gpt-neox-20b sequence_length: 513 --- # Pile Uncopyrighted (Tokenized) This is [danbraunai/pile-uncopyrighted](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted) tokenized with [`EleutherAI/gpt-neox-20b`](https://huggingface.co/EleutherAI/gpt-neox-20b). Each row contains a single `input_ids` column with 513 token IDs. Samples are concatenated with EOS tokens between them (following the approach in [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)), then reshaped into fixed-length sequences. ## Creation script ```python from datasets import DatasetDict, load_dataset from transformers import AutoTokenizer from spd.data import tokenize_and_concatenate SOURCE_REPO = "danbraunai/pile-uncopyrighted" TOKENIZER_NAME = "EleutherAI/gpt-neox-20b" N_CTX = 513 tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME) result = DatasetDict() for split in ["train", "val", "test"]: ds = load_dataset(SOURCE_REPO, split=split) tokenized = tokenize_and_concatenate( ds, tokenizer, column_name="text", max_length=N_CTX, add_bos_token=False, num_proc=10, to_lower=False, ) tokenized = tokenized.with_format(None) result[split] = tokenized result.push_to_hub("danbraunai/pile-uncopyrighted-tok") ```
提供机构:
danbraunai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作