EleutherAI/pythia-memorized-evals

Name: EleutherAI/pythia-memorized-evals
Creator: EleutherAI
Published: 2026-02-21 22:33:49
License: 暂无描述

Hugging Face2026-02-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/pythia-memorized-evals

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: index dtype: int64 - name: tokens sequence: int64 - name: __index_level_0__ dtype: int64 tags: - memorization - pythia - the-pile - language-modeling --- # Pythia Memorized Evals This dataset contains the results of memorization evaluations for all [Pythia](https://huggingface.co/collections/EleutherAI/pythia-scaling-suite-64fb5dfa8c21ebb3db7ad2e1) models. For each model, the dataset lists every training sequence that the fully trained model has memorized. A training sequence is considered **memorized** if, when prompted with the first 32 tokens of the sequence, the model's greedy continuation exactly matches the next 32 tokens. This is evaluated over all ~146M training sequences in the Pile. This dataset was generated for the paper [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2304.11158) (NeurIPS 2023). That paper introduced the study of memorization at the level of individual sequences (prior work had treated memorization as a corpus-level statistical phenomenon) and posed the problem of proactively predicting which specific sequences a model will memorize before or during training. ## Dataset Structure Each row represents a single memorized training sequence and has three columns: | Column | Type | Description | |--------|------|-------------| | `index` | int | Index of the training sequence (0-indexed into the Pile training data, ~146M sequences total) | | `tokens` | list[int] | The first 64 tokens of the training sequence (32-token prompt + 32-token continuation) | | `__index_level_0__` | int | Same as `index` (artifact of the original pandas export) | ### Splits There are 16 primary splits, one per model, covering all 8 Pythia model sizes in both the standard ("duped") and deduplicated ("deduped") variants: | Split | Memorized Sequences | |-------|--------------------:| | `duped.70m` | 463,953 | | `duped.160m` | 689,673 | | `duped.410m` | 970,341 | | `duped.1b` | 1,256,141 | | `duped.1.4b` | 1,373,722 | | `duped.2.8b` | 1,675,077 | | `duped.6.9b` | 2,120,969 | | `duped.12b` | 2,382,326 | | `deduped.70m` | 411,448 | | `deduped.160m` | 581,195 | | `deduped.410m` | 811,039 | | `deduped.1b` | 1,032,865 | | `deduped.1.4b` | 1,048,097 | | `deduped.2.8b` | 1,355,211 | | `deduped.6.9b` | 1,680,294 | | `deduped.12b` | 1,871,215 | There are also 13 additional splits for intermediate training checkpoints of the 12B models: | Split | Memorized Sequences | |-------|--------------------:| | `duped.12b.23000` | 198,175 | | `duped.12b.43000` | 442,253 | | `duped.12b.63000` | 724,678 | | `duped.12b.83000` | 1,068,501 | | `duped.12b.103000` | 1,510,459 | | `duped.12b.123000` | 1,996,011 | | `deduped.12b.23000` | 163,418 | | `deduped.12b.43000` | 358,863 | | `deduped.12b.63000` | 585,067 | | `deduped.12b.83000` | 852,068 | | `deduped.12b.103000` | 1,195,578 | | `deduped.12b.123000` | 1,564,055 | | `deduped.1b.new` | 1,032,865 | The intermediate checkpoint splits record which sequences are memorized at that point in training (e.g., `duped.12b.83000` is the set of sequences memorized by `pythia-12b` at step 83,000 out of 143,000). The primary splits without a step number use the final checkpoint (step 143,000). ## Usage ```python from datasets import load_dataset # Load memorized sequences for a specific model ds = load_dataset("EleutherAI/pythia-memorized-evals", split="duped.1.4b") # Get the set of memorized training sequence indices memorized_indices = set(ds["index"]) print(f"{len(memorized_indices):,} sequences memorized") # Decode the tokens to see what was memorized from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-1.4b") for i in range(5): text = tokenizer.decode(ds[i]["tokens"]) print(f"Sequence {ds[i]['index']}: {text[:100]}...") ``` ## Memorization Definition A sequence is memorized if the model can **exactly reproduce** 32 tokens given the preceding 32 tokens as a prompt, using greedy decoding (argmax at each position). Specifically: 1. Prompt the model with the first 32 tokens of the training sequence 2. Greedily generate the next 32 tokens 3. The sequence is memorized if all 32 generated tokens exactly match the ground truth continuation This evaluation is run over all ~146M training sequences in the Pile. ## Key Findings - Larger models memorize more: Pythia-12B memorizes roughly 5x as many sequences as Pythia-70M. - Training on deduplicated data reduces memorization across all model sizes. - Memorization increases throughout training, as shown by the intermediate 12B checkpoint splits. - Despite these large counts, even the largest model memorizes fewer than 2% of training sequences. ## Citation ```bibtex @article{biderman2023emergent, title={Emergent and Predictable Memorization in Large Language Models}, author={Biderman, Stella and Prashanth, USVSN Sai and Sutawika, Lintang and Schoelkopf, Hailey and Anthony, Quentin and Purohit, Shivanshu and Raff, Edward}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={28072--28090}, year={2023} } ``` The Pythia models used in this work are described in: ```bibtex @inproceedings{biderman2023pythia, title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling}, author={Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal}, booktitle={Proceedings of the 40th International Conference on Machine Learning}, year={2023}, url={https://arxiv.org/abs/2304.01373} } ``` ## Related Resources - [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2304.11158) (NeurIPS 2023) - [Pythia model collection](https://huggingface.co/collections/EleutherAI/pythia-scaling-suite-64fb5dfa8c21ebb3db7ad2e1) - [Pythia paper](https://arxiv.org/abs/2304.01373) (ICML 2023) - [the Pile training data (preshuffled, standard)](https://huggingface.co/datasets/EleutherAI/pile-standard-pythia-preshuffled) - [the Pile training data (preshuffled, deduplicated)](https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled)

提供机构：

EleutherAI

原始信息汇总