five

EleutherAI/pythia-memorized-evals

收藏
Hugging Face2026-02-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/pythia-memorized-evals
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: index dtype: int64 - name: tokens sequence: int64 - name: __index_level_0__ dtype: int64 tags: - memorization - pythia - the-pile - language-modeling --- # Pythia Memorized Evals This dataset contains the results of memorization evaluations for all [Pythia](https://huggingface.co/collections/EleutherAI/pythia-scaling-suite-64fb5dfa8c21ebb3db7ad2e1) models. For each model, the dataset lists every training sequence that the fully trained model has memorized. A training sequence is considered **memorized** if, when prompted with the first 32 tokens of the sequence, the model's greedy continuation exactly matches the next 32 tokens. This is evaluated over all ~146M training sequences in the Pile. This dataset was generated for the paper [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2304.11158) (NeurIPS 2023). That paper introduced the study of memorization at the level of individual sequences (prior work had treated memorization as a corpus-level statistical phenomenon) and posed the problem of proactively predicting which specific sequences a model will memorize before or during training. ## Dataset Structure Each row represents a single memorized training sequence and has three columns: | Column | Type | Description | |--------|------|-------------| | `index` | int | Index of the training sequence (0-indexed into the Pile training data, ~146M sequences total) | | `tokens` | list[int] | The first 64 tokens of the training sequence (32-token prompt + 32-token continuation) | | `__index_level_0__` | int | Same as `index` (artifact of the original pandas export) | ### Splits There are 16 primary splits, one per model, covering all 8 Pythia model sizes in both the standard ("duped") and deduplicated ("deduped") variants: | Split | Memorized Sequences | |-------|--------------------:| | `duped.70m` | 463,953 | | `duped.160m` | 689,673 | | `duped.410m` | 970,341 | | `duped.1b` | 1,256,141 | | `duped.1.4b` | 1,373,722 | | `duped.2.8b` | 1,675,077 | | `duped.6.9b` | 2,120,969 | | `duped.12b` | 2,382,326 | | `deduped.70m` | 411,448 | | `deduped.160m` | 581,195 | | `deduped.410m` | 811,039 | | `deduped.1b` | 1,032,865 | | `deduped.1.4b` | 1,048,097 | | `deduped.2.8b` | 1,355,211 | | `deduped.6.9b` | 1,680,294 | | `deduped.12b` | 1,871,215 | There are also 13 additional splits for intermediate training checkpoints of the 12B models: | Split | Memorized Sequences | |-------|--------------------:| | `duped.12b.23000` | 198,175 | | `duped.12b.43000` | 442,253 | | `duped.12b.63000` | 724,678 | | `duped.12b.83000` | 1,068,501 | | `duped.12b.103000` | 1,510,459 | | `duped.12b.123000` | 1,996,011 | | `deduped.12b.23000` | 163,418 | | `deduped.12b.43000` | 358,863 | | `deduped.12b.63000` | 585,067 | | `deduped.12b.83000` | 852,068 | | `deduped.12b.103000` | 1,195,578 | | `deduped.12b.123000` | 1,564,055 | | `deduped.1b.new` | 1,032,865 | The intermediate checkpoint splits record which sequences are memorized at that point in training (e.g., `duped.12b.83000` is the set of sequences memorized by `pythia-12b` at step 83,000 out of 143,000). The primary splits without a step number use the final checkpoint (step 143,000). ## Usage ```python from datasets import load_dataset # Load memorized sequences for a specific model ds = load_dataset("EleutherAI/pythia-memorized-evals", split="duped.1.4b") # Get the set of memorized training sequence indices memorized_indices = set(ds["index"]) print(f"{len(memorized_indices):,} sequences memorized") # Decode the tokens to see what was memorized from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-1.4b") for i in range(5): text = tokenizer.decode(ds[i]["tokens"]) print(f"Sequence {ds[i]['index']}: {text[:100]}...") ``` ## Memorization Definition A sequence is memorized if the model can **exactly reproduce** 32 tokens given the preceding 32 tokens as a prompt, using greedy decoding (argmax at each position). Specifically: 1. Prompt the model with the first 32 tokens of the training sequence 2. Greedily generate the next 32 tokens 3. The sequence is memorized if all 32 generated tokens exactly match the ground truth continuation This evaluation is run over all ~146M training sequences in the Pile. ## Key Findings - Larger models memorize more: Pythia-12B memorizes roughly 5x as many sequences as Pythia-70M. - Training on deduplicated data reduces memorization across all model sizes. - Memorization increases throughout training, as shown by the intermediate 12B checkpoint splits. - Despite these large counts, even the largest model memorizes fewer than 2% of training sequences. ## Citation ```bibtex @article{biderman2023emergent, title={Emergent and Predictable Memorization in Large Language Models}, author={Biderman, Stella and Prashanth, USVSN Sai and Sutawika, Lintang and Schoelkopf, Hailey and Anthony, Quentin and Purohit, Shivanshu and Raff, Edward}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={28072--28090}, year={2023} } ``` The Pythia models used in this work are described in: ```bibtex @inproceedings{biderman2023pythia, title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling}, author={Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal}, booktitle={Proceedings of the 40th International Conference on Machine Learning}, year={2023}, url={https://arxiv.org/abs/2304.01373} } ``` ## Related Resources - [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2304.11158) (NeurIPS 2023) - [Pythia model collection](https://huggingface.co/collections/EleutherAI/pythia-scaling-suite-64fb5dfa8c21ebb3db7ad2e1) - [Pythia paper](https://arxiv.org/abs/2304.01373) (ICML 2023) - [the Pile training data (preshuffled, standard)](https://huggingface.co/datasets/EleutherAI/pile-standard-pythia-preshuffled) - [the Pile training data (preshuffled, deduplicated)](https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled)
提供机构:
EleutherAI
原始信息汇总

数据集概述

数据集特征

  • index:数据类型为 int64。
  • tokens:数据类型为 int64,具有序列属性。
  • index_level_0:数据类型为 int64。

数据集分割

数据集包含多个分割,每个分割具有不同的名称、字节数和示例数。以下是部分分割的详细信息:

  • duped.1.4b:字节数为730820104,示例数为1373722。
  • deduped.1.4b:字节数为557587604,示例数为1048097。
  • duped.160m:字节数为366906036,示例数为689673。
  • deduped.160m:字节数为309195740,示例数为581195。
  • duped.12b:字节数为1267397432,示例数为2382326。
  • deduped.12b:字节数为995486380,示例数为1871215。
  • duped.70m:字节数为246822996,示例数为463953。
  • deduped.70m:字节数为218890336,示例数为411448。
  • duped.2.8b:字节数为891140964,示例数为1675077。
  • deduped.2.8b:字节数为720972252,示例数为1355211。
  • duped.410m:字节数为516221412,示例数为970341。
  • deduped.410m:字节数为431472748,示例数为811039。
  • duped.6.9b:字节数为1128355508,示例数为2120969。
  • deduped.6.9b:字节数为893916408,示例数为1680294。
  • duped.1b:字节数为668267012,示例数为1256141。
  • deduped.1b:字节数为549484180,示例数为1032865。
  • duped.12b.23000:字节数为105429100,示例数为198175。
  • duped.12b.43000:字节数为235278596,示例数为442253。
  • duped.12b.63000:字节数为385528696,示例数为724678。
  • duped.12b.83000:字节数为568442532,示例数为1068501。
  • duped.12b.103000:字节数为803564188,示例数为1510459。
  • duped.12b.123000:字节数为1061877852,示例数为1996011。
  • deduped.12b.23000:字节数为86938376,示例数为163418。
  • deduped.12b.43000:字节数为190915116,示例数为358863。
  • deduped.12b.63000:字节数为311255644,示例数为585067。
  • deduped.12b.83000:字节数为453300176,示例数为852068。
  • deduped.12b.103000:字节数为636047496,示例数为1195578。
  • deduped.12b.123000:字节数为832077260,示例数为1564055。
  • deduped.1b.new:字节数为549484180,示例数为1032865。

数据集大小

  • 下载大小:4735823411字节。
  • 数据集大小:16713076324字节。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作