alvin319/semantic-memorization-partial-2023-09-03
收藏Hugging Face2023-09-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alvin319/semantic-memorization-partial-2023-09-03
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
configs:
- config_name: default
data_files:
- split: pile_deduped_70m
path: data/pile_deduped_70m-*
- split: memories_deduped_70m
path: data/memories_deduped_70m-*
- split: pile_deduped_160m
path: data/pile_deduped_160m-*
- split: memories_deduped_160m
path: data/memories_deduped_160m-*
- split: pile_deduped_410m
path: data/pile_deduped_410m-*
- split: memories_deduped_410m
path: data/memories_deduped_410m-*
- split: pile_deduped_1b
path: data/pile_deduped_1b-*
- split: memories_deduped_1b
path: data/memories_deduped_1b-*
- split: pile_deduped_1.4b
path: data/pile_deduped_1.4b-*
- split: memories_deduped_1.4b
path: data/memories_deduped_1.4b-*
- split: pile_deduped_2.8b
path: data/pile_deduped_2.8b-*
- split: memories_deduped_2.8b
path: data/memories_deduped_2.8b-*
- split: pile_deduped_6.9b
path: data/pile_deduped_6.9b-*
- split: memories_deduped_6.9b
path: data/memories_deduped_6.9b-*
- split: pile_deduped_12b
path: data/pile_deduped_12b-*
- split: memories_deduped_12b
path: data/memories_deduped_12b-*
- split: pile_duped_70m
path: data/pile_duped_70m-*
- split: memories_duped_70m
path: data/memories_duped_70m-*
- split: pile_duped_160m
path: data/pile_duped_160m-*
- split: memories_duped_160m
path: data/memories_duped_160m-*
- split: pile_duped_410m
path: data/pile_duped_410m-*
- split: memories_duped_410m
path: data/memories_duped_410m-*
- split: pile_duped_1b
path: data/pile_duped_1b-*
- split: memories_duped_1b
path: data/memories_duped_1b-*
- split: pile_duped_1.4b
path: data/pile_duped_1.4b-*
- split: memories_duped_1.4b
path: data/memories_duped_1.4b-*
- split: pile_duped_2.8b
path: data/pile_duped_2.8b-*
- split: memories_duped_2.8b
path: data/memories_duped_2.8b-*
- split: pile_duped_6.9b
path: data/pile_duped_6.9b-*
- split: memories_duped_6.9b
path: data/memories_duped_6.9b-*
- split: pile_duped_12b
path: data/pile_duped_12b-*
- split: memories_duped_12b
path: data/memories_duped_12b-*
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: tokens
sequence: int64
- name: memorized_frequencies
sequence: int64
- name: non_memorized_frequencies
sequence: int64
- name: memorization_score
dtype: float64
- name: sequence_frequency
dtype: int64
splits:
- name: pile_deduped_70m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_70m
num_bytes: 646796256
num_examples: 411448
- name: pile_deduped_160m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_160m
num_bytes: 913638540
num_examples: 581195
- name: pile_deduped_410m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_410m
num_bytes: 1274953308
num_examples: 811039
- name: pile_deduped_1b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_1b
num_bytes: 1623663780
num_examples: 1032865
- name: pile_deduped_1.4b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_1.4b
num_bytes: 1647608484
num_examples: 1048097
- name: pile_deduped_2.8b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_2.8b
num_bytes: 2130391692
num_examples: 1355211
- name: pile_deduped_6.9b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_6.9b
num_bytes: 2641422168
num_examples: 1680294
- name: pile_deduped_12b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_deduped_12b
num_bytes: 2941549980
num_examples: 1871215
- name: pile_duped_70m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_70m
num_bytes: 729334116
num_examples: 463953
- name: pile_duped_160m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_160m
num_bytes: 1084165956
num_examples: 689673
- name: pile_duped_410m
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_410m
num_bytes: 1525376052
num_examples: 970341
- name: pile_duped_1b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_1b
num_bytes: 1974653652
num_examples: 1256141
- name: pile_duped_1.4b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_1.4b
num_bytes: 2159490984
num_examples: 1373722
- name: pile_duped_2.8b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_2.8b
num_bytes: 2633221044
num_examples: 1675077
- name: pile_duped_6.9b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_6.9b
num_bytes: 3334163268
num_examples: 2120969
- name: pile_duped_12b
num_bytes: 7860000000
num_examples: 5000000
- name: memories_duped_12b
num_bytes: 3745016472
num_examples: 2382326
download_size: 11256676441
dataset_size: 156765445752
---
This dataset is a partial computation of metrics (memorized token frequencies, non-memorized token frequencies, sequence frequencies) needed for [research](https://github.com/EleutherAI/semantic-memorization).
提供机构:
alvin319
原始信息汇总
数据集概述
许可证
- MIT许可证
配置
- 默认配置
- 数据文件路径和分割:
pile_deduped_70m:data/pile_deduped_70m-*memories_deduped_70m:data/memories_deduped_70m-*pile_deduped_160m:data/pile_deduped_160m-*memories_deduped_160m:data/memories_deduped_160m-*pile_deduped_410m:data/pile_deduped_410m-*memories_deduped_410m:data/memories_deduped_410m-*pile_deduped_1b:data/pile_deduped_1b-*memories_deduped_1b:data/memories_deduped_1b-*pile_deduped_1.4b:data/pile_deduped_1.4b-*memories_deduped_1.4b:data/memories_deduped_1.4b-*pile_deduped_2.8b:data/pile_deduped_2.8b-*memories_deduped_2.8b:data/memories_deduped_2.8b-*pile_deduped_6.9b:data/pile_deduped_6.9b-*memories_deduped_6.9b:data/memories_deduped_6.9b-*pile_deduped_12b:data/pile_deduped_12b-*memories_deduped_12b:data/memories_deduped_12b-*pile_duped_70m:data/pile_duped_70m-*memories_duped_70m:data/memories_duped_70m-*pile_duped_160m:data/pile_duped_160m-*memories_duped_160m:data/memories_duped_160m-*pile_duped_410m:data/pile_duped_410m-*memories_duped_410m:data/memories_duped_410m-*pile_duped_1b:data/pile_duped_1b-*memories_duped_1b:data/memories_duped_1b-*pile_duped_1.4b:data/pile_duped_1.4b-*memories_duped_1.4b:data/memories_duped_1.4b-*pile_duped_2.8b:data/pile_duped_2.8b-*memories_duped_2.8b:data/memories_duped_2.8b-*pile_duped_6.9b:data/pile_duped_6.9b-*memories_duped_6.9b:data/memories_duped_6.9b-*pile_duped_12b:data/pile_duped_12b-*memories_duped_12b:data/memories_duped_12b-*
- 数据文件路径和分割:
数据集信息
-
特征:
sequence_id:int64tokens:int64序列memorized_frequencies:int64序列non_memorized_frequencies:int64序列memorization_score:float64sequence_frequency:int64
-
分割:
pile_deduped_70m: 字节数7860000000, 样本数5000000memories_deduped_70m: 字节数646796256, 样本数411448pile_deduped_160m: 字节数7860000000, 样本数5000000memories_deduped_160m: 字节数913638540, 样本数581195pile_deduped_410m: 字节数7860000000, 样本数5000000memories_deduped_410m: 字节数1274953308, 样本数811039pile_deduped_1b: 字节数7860000000, 样本数5000000memories_deduped_1b: 字节数1623663780, 样本数1032865pile_deduped_1.4b: 字节数7860000000, 样本数5000000memories_deduped_1.4b: 字节数1647608484, 样本数1048097pile_deduped_2.8b: 字节数7860000000, 样本数5000000memories_deduped_2.8b: 字节数2130391692, 样本数1355211pile_deduped_6.9b: 字节数7860000000, 样本数5000000memories_deduped_6.9b: 字节数2641422168, 样本数1680294pile_deduped_12b: 字节数7860000000, 样本数5000000memories_deduped_12b: 字节数2941549980, 样本数1871215pile_duped_70m: 字节数7860000000, 样本数5000000memories_duped_70m: 字节数729334116, 样本数463953pile_duped_160m: 字节数7860000000, 样本数5000000memories_duped_160m: 字节数1084165956, 样本数689673pile_duped_410m: 字节数7860000000, 样本数5000000memories_duped_410m: 字节数1525376052, 样本数970341pile_duped_1b: 字节数7860000000, 样本数5000000memories_duped_1b: 字节数1974653652, 样本数1256141pile_duped_1.4b: 字节数7860000000, 样本数5000000memories_duped_1.4b: 字节数2159490984, 样本数1373722pile_duped_2.8b: 字节数7860000000, 样本数5000000memories_duped_2.8b: 字节数2633221044, 样本数1675077pile_duped_6.9b: 字节数7860000000, 样本数5000000memories_duped_6.9b: 字节数3334163268, 样本数2120969pile_duped_12b: 字节数7860000000, 样本数5000000memories_duped_12b: 字节数3745016472, 样本数2382326
数据集大小
- 下载大小:
11256676441字节 - 数据集大小:
156765445752字节



