five

danbraunai/pile-uncopyrighted-tok-shuffled

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted-tok-shuffled
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* dataset_info: features: - name: input_ids list: int32 splits: - name: train num_bytes: 1010480246264 num_examples: 491478719 - name: val num_bytes: 5706738456 num_examples: 2775651 - name: test num_bytes: 571175304 num_examples: 277809 download_size: 416018811941 dataset_size: 1016758160024 license: mit --- # Pile Uncopyrighted (Tokenized + Shuffled) Globally shuffled version of the Pile (uncopyrighted subset), tokenized into fixed-length sequences of 513 token IDs using [`EleutherAI/gpt-neox-20b`](https://huggingface.co/EleutherAI/gpt-neox-20b). Each row contains a single `input_ids` column (513 int32 values). Documents are concatenated with EOS tokens between them, then reshaped into fixed-length sequences (following the [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens) approach). Sequences are then globally shuffled (seed=42) so that consecutive rows are not from the same document. | Split | Sequences | Size | |-------|-----------|------| | train | 491,478,719 | ~1 TB | | val | 2,775,651 | ~5.7 GB | | test | 277,809 | ~571 MB | ## Provenance This dataset was originally created through a three-stage pipeline across three HuggingFace repos: 1. **Re-split** [`monology/pile-uncopyrighted`](https://huggingface.co/datasets/monology/pile-uncopyrighted) → [`danbraunai/pile-uncopyrighted`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted): Took the single "train" split; last 100K rows → test, preceding 1M → val, rest → train. 2. **Tokenize** [`danbraunai/pile-uncopyrighted`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted) → [`danbraunai/pile-uncopyrighted-tok`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted-tok): Tokenized with `EleutherAI/gpt-neox-20b` into 513-token sequences using `tokenize_and_concatenate` from [`spd/data.py`](https://github.com/ApolloResearch/spd/blob/main/spd/data.py). 3. **Shuffle** [`danbraunai/pile-uncopyrighted-tok`](https://huggingface.co/datasets/danbraunai/pile-uncopyrighted-tok) → **this dataset**: Global shuffle (seed=42) to break document-order correlation between consecutive sequences. ## Original creation scripts The three scripts that were actually run to create this dataset: <details> <summary>Stage 1: Re-split (from danbraunai/pile-uncopyrighted README)</summary> ```python from datasets import DatasetDict, load_dataset ds = load_dataset("monology/pile-uncopyrighted", split="train") n = len(ds) VAL_SIZE = 1_000_000 TEST_SIZE = 100_000 result = DatasetDict({ "train": ds.select(range(n - VAL_SIZE - TEST_SIZE)), "val": ds.select(range(n - VAL_SIZE - TEST_SIZE, n - TEST_SIZE)), "test": ds.select(range(n - TEST_SIZE, n)), }) result.push_to_hub("danbraunai/pile-uncopyrighted") ``` </details> <details> <summary>Stage 2: Tokenize (from danbraunai/pile-uncopyrighted-tok README)</summary> ```python from datasets import DatasetDict, load_dataset from transformers import AutoTokenizer from spd.data import tokenize_and_concatenate SOURCE_REPO = "danbraunai/pile-uncopyrighted" TOKENIZER_NAME = "EleutherAI/gpt-neox-20b" N_CTX = 513 tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME) result = DatasetDict() for split in ["train", "val", "test"]: ds = load_dataset(SOURCE_REPO, split=split) tokenized = tokenize_and_concatenate( ds, tokenizer, column_name="text", max_length=N_CTX, add_bos_token=False, num_proc=10, to_lower=False, ) tokenized = tokenized.with_format(None) result[split] = tokenized result.push_to_hub("danbraunai/pile-uncopyrighted-tok") ``` </details> <details> <summary>Stage 3: Shuffle (from scripts/shuffle_and_reupload_dataset.py)</summary> ```python import time from datasets import load_dataset DATASET_NAME = "danbraunai/pile-uncopyrighted-tok" NEW_DATASET_NAME = "danbraunai/pile-uncopyrighted-tok-shuffled" SEED = 42 NUM_PROC = 160 SPLITS = ["train", "val", "test"] SHARD_COUNTS = {"train": 2021, "val": 12, "test": 2} def process_split(split): ds = load_dataset(DATASET_NAME, split=split) ds = ds.shuffle(seed=SEED) ds = ds.flatten_indices(num_proc=NUM_PROC) ds.push_to_hub(NEW_DATASET_NAME, split=split, num_shards=SHARD_COUNTS[split]) for split in SPLITS: process_split(split) ``` </details> ## Unified creation script The script below produces this dataset in a single run directly from `monology/pile-uncopyrighted`, inlining the tokenization logic so it has no dependency on the `spd` package. It lives at [`scripts/create_pile_tok_shuffled.py`](https://github.com/ApolloResearch/spd/blob/main/scripts/create_pile_tok_shuffled.py) in the SPD repo. Requirements: `pip install datasets transformers numpy` <details> <summary>scripts/create_pile_tok_shuffled.py</summary> ```python """Create danbraunai/pile-uncopyrighted-tok-shuffled from monology/pile-uncopyrighted. Unified script combining three stages that were originally run separately: 1. Re-split: Load single "train" split, carve out val (1M rows) and test (100K rows) 2. Tokenize: Tokenize with EleutherAI/gpt-neox-20b into 513-token sequences 3. Shuffle & upload: Global shuffle (seed=42), flatten, push to HuggingFace Hub Requirements: datasets, transformers, numpy, huggingface_hub (with write access to target repo) Usage: python scripts/create_pile_tok_shuffled.py """ import time import numpy as np from datasets import Dataset, DatasetDict, load_dataset from transformers import AutoTokenizer SOURCE_REPO = "monology/pile-uncopyrighted" TARGET_REPO = "danbraunai/pile-uncopyrighted-tok-shuffled" TOKENIZER_NAME = "EleutherAI/gpt-neox-20b" N_CTX = 513 VAL_SIZE = 1_000_000 TEST_SIZE = 100_000 SHUFFLE_SEED = 42 TOKENIZE_NUM_PROC = 10 FLATTEN_NUM_PROC = 160 SHARD_COUNTS = {"train": 2021, "val": 12, "test": 2} # --------------------------------------------------------------------------- # Stage 1: Load and re-split # --------------------------------------------------------------------------- def load_and_split() -> DatasetDict: """Load monology/pile-uncopyrighted and split into train/val/test. Split boundaries (from the end of the dataset): - Last 100K rows → test - Preceding 1M → val - Everything else → train """ print("Stage 1: Loading source dataset...", flush=True) t0 = time.time() ds = load_dataset(SOURCE_REPO, split="train") n = len(ds) print(f" Loaded {n:,} rows in {time.time() - t0:.1f}s", flush=True) assert n > VAL_SIZE + TEST_SIZE, f"Dataset too small: {n}" train_end = n - VAL_SIZE - TEST_SIZE print( f" Splitting: train={train_end:,}, val={VAL_SIZE:,}, test={TEST_SIZE:,}", flush=True, ) return DatasetDict( { "train": ds.select(range(train_end)), "val": ds.select(range(train_end, train_end + VAL_SIZE)), "test": ds.select(range(train_end + VAL_SIZE, n)), } ) # --------------------------------------------------------------------------- # Stage 2: Tokenize # --------------------------------------------------------------------------- def tokenize_and_concatenate( dataset: Dataset, tokenizer: AutoTokenizer, max_length: int, column_name: str = "text", num_proc: int = TOKENIZE_NUM_PROC, ) -> Dataset: """Tokenize text and reshape into fixed-length sequences. Joins documents with EOS tokens, tokenizes in parallel chunks, then reshapes into (num_sequences, max_length). Adapted from TransformerLens. """ for key in dataset.features: if key != column_name: dataset = dataset.remove_columns(key) def tokenize_fn( examples: dict[str, list[str]], ) -> dict[str, np.ndarray]: full_text = tokenizer.eos_token.join(examples[column_name]) num_chunks = 20 chunk_length = (len(full_text) - 1) // num_chunks + 1 chunks = [full_text[i * chunk_length : (i + 1) * chunk_length] for i in range(num_chunks)] tokens = np.concatenate( [tokenizer.encode(chunk, add_special_tokens=False) for chunk in chunks] ) num_batches = len(tokens) // max_length tokens = tokens[: max_length * num_batches].reshape((num_batches, max_length)) return {"input_ids": tokens} return dataset.map(tokenize_fn, batched=True, remove_columns=[column_name], num_proc=num_proc) # --------------------------------------------------------------------------- # Stage 3: Shuffle and upload # --------------------------------------------------------------------------- def shuffle_and_upload(ds: Dataset, split: str) -> None: """Globally shuffle sequences and push to HuggingFace Hub.""" t0 = time.time() print(f" Shuffling (seed={SHUFFLE_SEED})...", flush=True) ds = ds.shuffle(seed=SHUFFLE_SEED) print(f" Shuffled in {time.time() - t0:.1f}s", flush=True) t1 = time.time() print(f" Flattening indices (num_proc={FLATTEN_NUM_PROC})...", flush=True) ds = ds.flatten_indices(num_proc=FLATTEN_NUM_PROC) print(f" Flattened in {time.time() - t1:.1f}s", flush=True) t2 = time.time() num_shards = SHARD_COUNTS[split] print(f" Pushing to {TARGET_REPO} ({num_shards} shards)...", flush=True) ds.push_to_hub(TARGET_REPO, split=split, num_shards=num_shards) print(f" Pushed in {time.time() - t2:.1f}s", flush=True) # --------------------------------------------------------------------------- # Main # --------------------------------------------------------------------------- def main(): total_start = time.time() tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME) splits = load_and_split() for split_name in ["train", "val", "test"]: print(f"\n{'=' * 60}", flush=True) print(f"Processing split: {split_name}", flush=True) print(f"{'=' * 60}", flush=True) t0 = time.time() print(f"Stage 2: Tokenizing (max_length={N_CTX})...", flush=True) tokenized = tokenize_and_concatenate(splits[split_name], tokenizer, max_length=N_CTX) print( f" Tokenized {len(tokenized):,} sequences in {time.time() - t0:.1f}s", flush=True, ) print("Stage 3: Shuffle and upload", flush=True) shuffle_and_upload(tokenized, split_name) print(f"\nAll done in {time.time() - total_start:.1f}s", flush=True) if __name__ == "__main__": main() ``` </details> ## Usage ```python from datasets import load_dataset # Load a split ds = load_dataset("danbraunai/pile-uncopyrighted-tok-shuffled", split="train", streaming=True) for batch in ds: input_ids = batch["input_ids"] # list of 513 int32 token IDs break ```
提供机构:
danbraunai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作