danbraunai/pile-uncopyrighted-tok
收藏Hugging Face2026-02-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted-tok
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
source: danbraunai/pile-uncopyrighted
tokenizer: EleutherAI/gpt-neox-20b
sequence_length: 513
---
# Pile Uncopyrighted (Tokenized)
This is [danbraunai/pile-uncopyrighted](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted)
tokenized with [`EleutherAI/gpt-neox-20b`](https://huggingface.co/EleutherAI/gpt-neox-20b).
Each row contains a single `input_ids` column with 513 token IDs.
Samples are concatenated with EOS tokens between them (following the approach in
[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)), then
reshaped into fixed-length sequences.
## Creation script
```python
from datasets import DatasetDict, load_dataset
from transformers import AutoTokenizer
from spd.data import tokenize_and_concatenate
SOURCE_REPO = "danbraunai/pile-uncopyrighted"
TOKENIZER_NAME = "EleutherAI/gpt-neox-20b"
N_CTX = 513
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
result = DatasetDict()
for split in ["train", "val", "test"]:
ds = load_dataset(SOURCE_REPO, split=split)
tokenized = tokenize_and_concatenate(
ds,
tokenizer,
column_name="text",
max_length=N_CTX,
add_bos_token=False,
num_proc=10,
to_lower=False,
)
tokenized = tokenized.with_format(None)
result[split] = tokenized
result.push_to_hub("danbraunai/pile-uncopyrighted-tok")
```
提供机构:
danbraunai



