five

alvin319/semantic-memorization-partial-2023-09-03

收藏
Hugging Face2023-09-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alvin319/semantic-memorization-partial-2023-09-03
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit configs: - config_name: default data_files: - split: pile_deduped_70m path: data/pile_deduped_70m-* - split: memories_deduped_70m path: data/memories_deduped_70m-* - split: pile_deduped_160m path: data/pile_deduped_160m-* - split: memories_deduped_160m path: data/memories_deduped_160m-* - split: pile_deduped_410m path: data/pile_deduped_410m-* - split: memories_deduped_410m path: data/memories_deduped_410m-* - split: pile_deduped_1b path: data/pile_deduped_1b-* - split: memories_deduped_1b path: data/memories_deduped_1b-* - split: pile_deduped_1.4b path: data/pile_deduped_1.4b-* - split: memories_deduped_1.4b path: data/memories_deduped_1.4b-* - split: pile_deduped_2.8b path: data/pile_deduped_2.8b-* - split: memories_deduped_2.8b path: data/memories_deduped_2.8b-* - split: pile_deduped_6.9b path: data/pile_deduped_6.9b-* - split: memories_deduped_6.9b path: data/memories_deduped_6.9b-* - split: pile_deduped_12b path: data/pile_deduped_12b-* - split: memories_deduped_12b path: data/memories_deduped_12b-* - split: pile_duped_70m path: data/pile_duped_70m-* - split: memories_duped_70m path: data/memories_duped_70m-* - split: pile_duped_160m path: data/pile_duped_160m-* - split: memories_duped_160m path: data/memories_duped_160m-* - split: pile_duped_410m path: data/pile_duped_410m-* - split: memories_duped_410m path: data/memories_duped_410m-* - split: pile_duped_1b path: data/pile_duped_1b-* - split: memories_duped_1b path: data/memories_duped_1b-* - split: pile_duped_1.4b path: data/pile_duped_1.4b-* - split: memories_duped_1.4b path: data/memories_duped_1.4b-* - split: pile_duped_2.8b path: data/pile_duped_2.8b-* - split: memories_duped_2.8b path: data/memories_duped_2.8b-* - split: pile_duped_6.9b path: data/pile_duped_6.9b-* - split: memories_duped_6.9b path: data/memories_duped_6.9b-* - split: pile_duped_12b path: data/pile_duped_12b-* - split: memories_duped_12b path: data/memories_duped_12b-* dataset_info: features: - name: sequence_id dtype: int64 - name: tokens sequence: int64 - name: memorized_frequencies sequence: int64 - name: non_memorized_frequencies sequence: int64 - name: memorization_score dtype: float64 - name: sequence_frequency dtype: int64 splits: - name: pile_deduped_70m num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_70m num_bytes: 646796256 num_examples: 411448 - name: pile_deduped_160m num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_160m num_bytes: 913638540 num_examples: 581195 - name: pile_deduped_410m num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_410m num_bytes: 1274953308 num_examples: 811039 - name: pile_deduped_1b num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_1b num_bytes: 1623663780 num_examples: 1032865 - name: pile_deduped_1.4b num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_1.4b num_bytes: 1647608484 num_examples: 1048097 - name: pile_deduped_2.8b num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_2.8b num_bytes: 2130391692 num_examples: 1355211 - name: pile_deduped_6.9b num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_6.9b num_bytes: 2641422168 num_examples: 1680294 - name: pile_deduped_12b num_bytes: 7860000000 num_examples: 5000000 - name: memories_deduped_12b num_bytes: 2941549980 num_examples: 1871215 - name: pile_duped_70m num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_70m num_bytes: 729334116 num_examples: 463953 - name: pile_duped_160m num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_160m num_bytes: 1084165956 num_examples: 689673 - name: pile_duped_410m num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_410m num_bytes: 1525376052 num_examples: 970341 - name: pile_duped_1b num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_1b num_bytes: 1974653652 num_examples: 1256141 - name: pile_duped_1.4b num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_1.4b num_bytes: 2159490984 num_examples: 1373722 - name: pile_duped_2.8b num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_2.8b num_bytes: 2633221044 num_examples: 1675077 - name: pile_duped_6.9b num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_6.9b num_bytes: 3334163268 num_examples: 2120969 - name: pile_duped_12b num_bytes: 7860000000 num_examples: 5000000 - name: memories_duped_12b num_bytes: 3745016472 num_examples: 2382326 download_size: 11256676441 dataset_size: 156765445752 --- This dataset is a partial computation of metrics (memorized token frequencies, non-memorized token frequencies, sequence frequencies) needed for [research](https://github.com/EleutherAI/semantic-memorization).
提供机构:
alvin319
原始信息汇总

数据集概述

许可证

  • MIT许可证

配置

  • 默认配置
    • 数据文件路径和分割:
      • pile_deduped_70m: data/pile_deduped_70m-*
      • memories_deduped_70m: data/memories_deduped_70m-*
      • pile_deduped_160m: data/pile_deduped_160m-*
      • memories_deduped_160m: data/memories_deduped_160m-*
      • pile_deduped_410m: data/pile_deduped_410m-*
      • memories_deduped_410m: data/memories_deduped_410m-*
      • pile_deduped_1b: data/pile_deduped_1b-*
      • memories_deduped_1b: data/memories_deduped_1b-*
      • pile_deduped_1.4b: data/pile_deduped_1.4b-*
      • memories_deduped_1.4b: data/memories_deduped_1.4b-*
      • pile_deduped_2.8b: data/pile_deduped_2.8b-*
      • memories_deduped_2.8b: data/memories_deduped_2.8b-*
      • pile_deduped_6.9b: data/pile_deduped_6.9b-*
      • memories_deduped_6.9b: data/memories_deduped_6.9b-*
      • pile_deduped_12b: data/pile_deduped_12b-*
      • memories_deduped_12b: data/memories_deduped_12b-*
      • pile_duped_70m: data/pile_duped_70m-*
      • memories_duped_70m: data/memories_duped_70m-*
      • pile_duped_160m: data/pile_duped_160m-*
      • memories_duped_160m: data/memories_duped_160m-*
      • pile_duped_410m: data/pile_duped_410m-*
      • memories_duped_410m: data/memories_duped_410m-*
      • pile_duped_1b: data/pile_duped_1b-*
      • memories_duped_1b: data/memories_duped_1b-*
      • pile_duped_1.4b: data/pile_duped_1.4b-*
      • memories_duped_1.4b: data/memories_duped_1.4b-*
      • pile_duped_2.8b: data/pile_duped_2.8b-*
      • memories_duped_2.8b: data/memories_duped_2.8b-*
      • pile_duped_6.9b: data/pile_duped_6.9b-*
      • memories_duped_6.9b: data/memories_duped_6.9b-*
      • pile_duped_12b: data/pile_duped_12b-*
      • memories_duped_12b: data/memories_duped_12b-*

数据集信息

  • 特征:

    • sequence_id: int64
    • tokens: int64序列
    • memorized_frequencies: int64序列
    • non_memorized_frequencies: int64序列
    • memorization_score: float64
    • sequence_frequency: int64
  • 分割:

    • pile_deduped_70m: 字节数 7860000000, 样本数 5000000
    • memories_deduped_70m: 字节数 646796256, 样本数 411448
    • pile_deduped_160m: 字节数 7860000000, 样本数 5000000
    • memories_deduped_160m: 字节数 913638540, 样本数 581195
    • pile_deduped_410m: 字节数 7860000000, 样本数 5000000
    • memories_deduped_410m: 字节数 1274953308, 样本数 811039
    • pile_deduped_1b: 字节数 7860000000, 样本数 5000000
    • memories_deduped_1b: 字节数 1623663780, 样本数 1032865
    • pile_deduped_1.4b: 字节数 7860000000, 样本数 5000000
    • memories_deduped_1.4b: 字节数 1647608484, 样本数 1048097
    • pile_deduped_2.8b: 字节数 7860000000, 样本数 5000000
    • memories_deduped_2.8b: 字节数 2130391692, 样本数 1355211
    • pile_deduped_6.9b: 字节数 7860000000, 样本数 5000000
    • memories_deduped_6.9b: 字节数 2641422168, 样本数 1680294
    • pile_deduped_12b: 字节数 7860000000, 样本数 5000000
    • memories_deduped_12b: 字节数 2941549980, 样本数 1871215
    • pile_duped_70m: 字节数 7860000000, 样本数 5000000
    • memories_duped_70m: 字节数 729334116, 样本数 463953
    • pile_duped_160m: 字节数 7860000000, 样本数 5000000
    • memories_duped_160m: 字节数 1084165956, 样本数 689673
    • pile_duped_410m: 字节数 7860000000, 样本数 5000000
    • memories_duped_410m: 字节数 1525376052, 样本数 970341
    • pile_duped_1b: 字节数 7860000000, 样本数 5000000
    • memories_duped_1b: 字节数 1974653652, 样本数 1256141
    • pile_duped_1.4b: 字节数 7860000000, 样本数 5000000
    • memories_duped_1.4b: 字节数 2159490984, 样本数 1373722
    • pile_duped_2.8b: 字节数 7860000000, 样本数 5000000
    • memories_duped_2.8b: 字节数 2633221044, 样本数 1675077
    • pile_duped_6.9b: 字节数 7860000000, 样本数 5000000
    • memories_duped_6.9b: 字节数 3334163268, 样本数 2120969
    • pile_duped_12b: 字节数 7860000000, 样本数 5000000
    • memories_duped_12b: 字节数 3745016472, 样本数 2382326

数据集大小

  • 下载大小: 11256676441字节
  • 数据集大小: 156765445752字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作