danbraunai/pile-uncopyrighted-tok-shuffled
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted-tok-shuffled
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
- split: test
path: data/test-*
dataset_info:
features:
- name: input_ids
list: int32
splits:
- name: train
num_bytes: 1010480246264
num_examples: 491478719
- name: val
num_bytes: 5706738456
num_examples: 2775651
- name: test
num_bytes: 571175304
num_examples: 277809
download_size: 416018811941
dataset_size: 1016758160024
license: mit
---
# Pile Uncopyrighted (Tokenized + Shuffled)
Globally shuffled version of the Pile (uncopyrighted subset), tokenized into
fixed-length sequences of 513 token IDs using
[`EleutherAI/gpt-neox-20b`](https://huggingface.co/EleutherAI/gpt-neox-20b).
Each row contains a single `input_ids` column (513 int32 values). Documents are
concatenated with EOS tokens between them, then reshaped into fixed-length
sequences (following the
[TransformerLens](https://github.com/TransformerLensOrg/TransformerLens) approach).
Sequences are then globally shuffled (seed=42) so that consecutive rows are not
from the same document.
| Split | Sequences | Size |
|-------|-----------|------|
| train | 491,478,719 | ~1 TB |
| val | 2,775,651 | ~5.7 GB |
| test | 277,809 | ~571 MB |
## Provenance
This dataset was originally created through a three-stage pipeline across three
HuggingFace repos:
1. **Re-split** [`monology/pile-uncopyrighted`](https://huggingface.co/datasets/monology/pile-uncopyrighted)
→ [`danbraunai/pile-uncopyrighted`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted):
Took the single "train" split; last 100K rows → test, preceding 1M → val,
rest → train.
2. **Tokenize** [`danbraunai/pile-uncopyrighted`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted)
→ [`danbraunai/pile-uncopyrighted-tok`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted-tok):
Tokenized with `EleutherAI/gpt-neox-20b` into 513-token sequences using
`tokenize_and_concatenate` from
[`spd/data.py`](https://github.com/ApolloResearch/spd/blob/main/spd/data.py).
3. **Shuffle** [`danbraunai/pile-uncopyrighted-tok`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted-tok)
→ **this dataset**: Global shuffle (seed=42) to break document-order
correlation between consecutive sequences.
## Original creation scripts
The three scripts that were actually run to create this dataset:
<details>
<summary>Stage 1: Re-split (from danbraunai/pile-uncopyrighted README)</summary>
```python
from datasets import DatasetDict, load_dataset
ds = load_dataset("monology/pile-uncopyrighted", split="train")
n = len(ds)
VAL_SIZE = 1_000_000
TEST_SIZE = 100_000
result = DatasetDict({
"train": ds.select(range(n - VAL_SIZE - TEST_SIZE)),
"val": ds.select(range(n - VAL_SIZE - TEST_SIZE, n - TEST_SIZE)),
"test": ds.select(range(n - TEST_SIZE, n)),
})
result.push_to_hub("danbraunai/pile-uncopyrighted")
```
</details>
<details>
<summary>Stage 2: Tokenize (from danbraunai/pile-uncopyrighted-tok README)</summary>
```python
from datasets import DatasetDict, load_dataset
from transformers import AutoTokenizer
from spd.data import tokenize_and_concatenate
SOURCE_REPO = "danbraunai/pile-uncopyrighted"
TOKENIZER_NAME = "EleutherAI/gpt-neox-20b"
N_CTX = 513
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
result = DatasetDict()
for split in ["train", "val", "test"]:
ds = load_dataset(SOURCE_REPO, split=split)
tokenized = tokenize_and_concatenate(
ds,
tokenizer,
column_name="text",
max_length=N_CTX,
add_bos_token=False,
num_proc=10,
to_lower=False,
)
tokenized = tokenized.with_format(None)
result[split] = tokenized
result.push_to_hub("danbraunai/pile-uncopyrighted-tok")
```
</details>
<details>
<summary>Stage 3: Shuffle (from scripts/shuffle_and_reupload_dataset.py)</summary>
```python
import time
from datasets import load_dataset
DATASET_NAME = "danbraunai/pile-uncopyrighted-tok"
NEW_DATASET_NAME = "danbraunai/pile-uncopyrighted-tok-shuffled"
SEED = 42
NUM_PROC = 160
SPLITS = ["train", "val", "test"]
SHARD_COUNTS = {"train": 2021, "val": 12, "test": 2}
def process_split(split):
ds = load_dataset(DATASET_NAME, split=split)
ds = ds.shuffle(seed=SEED)
ds = ds.flatten_indices(num_proc=NUM_PROC)
ds.push_to_hub(NEW_DATASET_NAME, split=split, num_shards=SHARD_COUNTS[split])
for split in SPLITS:
process_split(split)
```
</details>
## Unified creation script
The script below produces this dataset in a single run directly from
`monology/pile-uncopyrighted`, inlining the tokenization logic so it has no
dependency on the `spd` package. It lives at
[`scripts/create_pile_tok_shuffled.py`](https://github.com/ApolloResearch/spd/blob/main/scripts/create_pile_tok_shuffled.py)
in the SPD repo.
Requirements: `pip install datasets transformers numpy`
<details>
<summary>scripts/create_pile_tok_shuffled.py</summary>
```python
"""Create danbraunai/pile-uncopyrighted-tok-shuffled from monology/pile-uncopyrighted.
Unified script combining three stages that were originally run separately:
1. Re-split: Load single "train" split, carve out val (1M rows) and test (100K rows)
2. Tokenize: Tokenize with EleutherAI/gpt-neox-20b into 513-token sequences
3. Shuffle & upload: Global shuffle (seed=42), flatten, push to HuggingFace Hub
Requirements: datasets, transformers, numpy, huggingface_hub (with write access to target repo)
Usage: python scripts/create_pile_tok_shuffled.py
"""
import time
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset
from transformers import AutoTokenizer
SOURCE_REPO = "monology/pile-uncopyrighted"
TARGET_REPO = "danbraunai/pile-uncopyrighted-tok-shuffled"
TOKENIZER_NAME = "EleutherAI/gpt-neox-20b"
N_CTX = 513
VAL_SIZE = 1_000_000
TEST_SIZE = 100_000
SHUFFLE_SEED = 42
TOKENIZE_NUM_PROC = 10
FLATTEN_NUM_PROC = 160
SHARD_COUNTS = {"train": 2021, "val": 12, "test": 2}
# ---------------------------------------------------------------------------
# Stage 1: Load and re-split
# ---------------------------------------------------------------------------
def load_and_split() -> DatasetDict:
"""Load monology/pile-uncopyrighted and split into train/val/test.
Split boundaries (from the end of the dataset):
- Last 100K rows → test
- Preceding 1M → val
- Everything else → train
"""
print("Stage 1: Loading source dataset...", flush=True)
t0 = time.time()
ds = load_dataset(SOURCE_REPO, split="train")
n = len(ds)
print(f" Loaded {n:,} rows in {time.time() - t0:.1f}s", flush=True)
assert n > VAL_SIZE + TEST_SIZE, f"Dataset too small: {n}"
train_end = n - VAL_SIZE - TEST_SIZE
print(
f" Splitting: train={train_end:,}, val={VAL_SIZE:,}, test={TEST_SIZE:,}",
flush=True,
)
return DatasetDict(
{
"train": ds.select(range(train_end)),
"val": ds.select(range(train_end, train_end + VAL_SIZE)),
"test": ds.select(range(train_end + VAL_SIZE, n)),
}
)
# ---------------------------------------------------------------------------
# Stage 2: Tokenize
# ---------------------------------------------------------------------------
def tokenize_and_concatenate(
dataset: Dataset,
tokenizer: AutoTokenizer,
max_length: int,
column_name: str = "text",
num_proc: int = TOKENIZE_NUM_PROC,
) -> Dataset:
"""Tokenize text and reshape into fixed-length sequences.
Joins documents with EOS tokens, tokenizes in parallel chunks, then reshapes
into (num_sequences, max_length). Adapted from TransformerLens.
"""
for key in dataset.features:
if key != column_name:
dataset = dataset.remove_columns(key)
def tokenize_fn(
examples: dict[str, list[str]],
) -> dict[str, np.ndarray]:
full_text = tokenizer.eos_token.join(examples[column_name])
num_chunks = 20
chunk_length = (len(full_text) - 1) // num_chunks + 1
chunks = [full_text[i * chunk_length : (i + 1) * chunk_length] for i in range(num_chunks)]
tokens = np.concatenate(
[tokenizer.encode(chunk, add_special_tokens=False) for chunk in chunks]
)
num_batches = len(tokens) // max_length
tokens = tokens[: max_length * num_batches].reshape((num_batches, max_length))
return {"input_ids": tokens}
return dataset.map(tokenize_fn, batched=True, remove_columns=[column_name], num_proc=num_proc)
# ---------------------------------------------------------------------------
# Stage 3: Shuffle and upload
# ---------------------------------------------------------------------------
def shuffle_and_upload(ds: Dataset, split: str) -> None:
"""Globally shuffle sequences and push to HuggingFace Hub."""
t0 = time.time()
print(f" Shuffling (seed={SHUFFLE_SEED})...", flush=True)
ds = ds.shuffle(seed=SHUFFLE_SEED)
print(f" Shuffled in {time.time() - t0:.1f}s", flush=True)
t1 = time.time()
print(f" Flattening indices (num_proc={FLATTEN_NUM_PROC})...", flush=True)
ds = ds.flatten_indices(num_proc=FLATTEN_NUM_PROC)
print(f" Flattened in {time.time() - t1:.1f}s", flush=True)
t2 = time.time()
num_shards = SHARD_COUNTS[split]
print(f" Pushing to {TARGET_REPO} ({num_shards} shards)...", flush=True)
ds.push_to_hub(TARGET_REPO, split=split, num_shards=num_shards)
print(f" Pushed in {time.time() - t2:.1f}s", flush=True)
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
total_start = time.time()
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
splits = load_and_split()
for split_name in ["train", "val", "test"]:
print(f"\n{'=' * 60}", flush=True)
print(f"Processing split: {split_name}", flush=True)
print(f"{'=' * 60}", flush=True)
t0 = time.time()
print(f"Stage 2: Tokenizing (max_length={N_CTX})...", flush=True)
tokenized = tokenize_and_concatenate(splits[split_name], tokenizer, max_length=N_CTX)
print(
f" Tokenized {len(tokenized):,} sequences in {time.time() - t0:.1f}s",
flush=True,
)
print("Stage 3: Shuffle and upload", flush=True)
shuffle_and_upload(tokenized, split_name)
print(f"\nAll done in {time.time() - total_start:.1f}s", flush=True)
if __name__ == "__main__":
main()
```
</details>
## Usage
```python
from datasets import load_dataset
# Load a split
ds = load_dataset("danbraunai/pile-uncopyrighted-tok-shuffled", split="train", streaming=True)
for batch in ds:
input_ids = batch["input_ids"] # list of 513 int32 token IDs
break
```
提供机构:
danbraunai



