usvsnsp/generation-semantic-memorization-filters
收藏Hugging Face2024-01-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/generation-semantic-memorization-filters
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: text
dtype: string
- name: is_incrementing
dtype: bool
- name: sequence_duplicates
dtype: int64
- name: max_frequency
dtype: int64
- name: avg_frequency
dtype: float64
- name: min_frequency
dtype: int64
- name: median_frequency
dtype: float64
- name: p25_frequency
dtype: int64
- name: p75_frequency
dtype: int64
- name: frequencies
sequence: int64
- name: tokens
sequence: int64
- name: repeating_offset
dtype: int32
- name: num_repeating
dtype: int32
- name: smallest_repeating_chunk
sequence: int64
- name: nl_scores
dtype: float32
- name: 0_8_snowclones
dtype: int64
- name: 0_9_snowclones
dtype: int64
- name: 0_8_templates
dtype: int64
- name: 0_9_templates
dtype: int64
- name: huffman_coding_length
dtype: float64
- name: memorization_score
dtype: float64
- name: generation_perplexity
dtype: float64
splits:
- name: pile_duped_410m
num_bytes: 7123348430
num_examples: 5000000
- name: memories_deduped_160m
num_bytes: 823500869
num_examples: 581200
- name: pile_deduped_2.8b
num_bytes: 7134034756
num_examples: 5000000
- name: pile_deduped_1b
num_bytes: 7133409756
num_examples: 5000000
- name: memories_duped_2.8b
num_bytes: 2333155652
num_examples: 1675078
- name: pile_duped_1.4b
num_bytes: 7123348430
num_examples: 5000000
- name: pile_deduped_12b
num_bytes: 7133409756
num_examples: 5000000
- name: memories_deduped_1.4b
num_bytes: 1475622003
num_examples: 1048097
- name: memories_duped_410m
num_bytes: 1360848266
num_examples: 970344
- name: pile_duped_2.8b
num_bytes: 7123348430
num_examples: 5000000
- name: memories_deduped_410m
num_bytes: 1145472499
num_examples: 811040
- name: memories_deduped_1b
num_bytes: 1455198363
num_examples: 1032872
- name: memories_duped_12b
num_bytes: 3309825556
num_examples: 2382328
- name: memories_deduped_70m
num_bytes: 586086312
num_examples: 411448
- name: pile_duped_1b
num_bytes: 7123348430
num_examples: 5000000
- name: pile_deduped_1.4b
num_bytes: 7134034756
num_examples: 5000000
- name: pile_deduped_70m
num_bytes: 7133409756
num_examples: 5000000
- name: pile_duped_6.9b
num_bytes: 7123348430
num_examples: 5000000
- name: memories_duped_70m
num_bytes: 659286070
num_examples: 463960
- name: pile_duped_12b
num_bytes: 7123348430
num_examples: 5000000
- name: memories_deduped_6.9b
num_bytes: 2359576889
num_examples: 1680294
- name: pile_duped_160m
num_bytes: 7123348430
num_examples: 5000000
- name: memories_duped_6.9b
num_bytes: 2949197348
num_examples: 2120971
- name: pile_deduped_410m
num_bytes: 7133409756
num_examples: 5000000
- name: pile_deduped_160m
num_bytes: 7133409756
num_examples: 5000000
- name: memories_duped_1b
num_bytes: 1755692131
num_examples: 1256144
- name: pile_deduped_6.9b
num_bytes: 7134034756
num_examples: 5000000
- name: pile_duped_70m
num_bytes: 7123348430
num_examples: 5000000
- name: memories_deduped_12b
num_bytes: 2626258620
num_examples: 1871216
- name: memories_deduped_2.8b
num_bytes: 1904601881
num_examples: 1355211
- name: memories_duped_1.4b
num_bytes: 1917661495
num_examples: 1373723
- name: memories_duped_160m
num_bytes: 972466846
num_examples: 689680
download_size: 58080323885
dataset_size: 141690391288
configs:
- config_name: default
data_files:
- split: pile_duped_410m
path: data/pile_duped_410m-*
- split: memories_deduped_160m
path: data/memories_deduped_160m-*
- split: pile_deduped_2.8b
path: data/pile_deduped_2.8b-*
- split: pile_deduped_1b
path: data/pile_deduped_1b-*
- split: memories_duped_2.8b
path: data/memories_duped_2.8b-*
- split: pile_duped_1.4b
path: data/pile_duped_1.4b-*
- split: pile_deduped_12b
path: data/pile_deduped_12b-*
- split: memories_deduped_1.4b
path: data/memories_deduped_1.4b-*
- split: memories_duped_410m
path: data/memories_duped_410m-*
- split: pile_duped_2.8b
path: data/pile_duped_2.8b-*
- split: memories_deduped_410m
path: data/memories_deduped_410m-*
- split: memories_deduped_1b
path: data/memories_deduped_1b-*
- split: memories_duped_12b
path: data/memories_duped_12b-*
- split: memories_deduped_70m
path: data/memories_deduped_70m-*
- split: pile_duped_1b
path: data/pile_duped_1b-*
- split: pile_deduped_1.4b
path: data/pile_deduped_1.4b-*
- split: pile_deduped_70m
path: data/pile_deduped_70m-*
- split: pile_duped_6.9b
path: data/pile_duped_6.9b-*
- split: memories_duped_70m
path: data/memories_duped_70m-*
- split: pile_duped_12b
path: data/pile_duped_12b-*
- split: memories_deduped_6.9b
path: data/memories_deduped_6.9b-*
- split: pile_duped_160m
path: data/pile_duped_160m-*
- split: memories_duped_6.9b
path: data/memories_duped_6.9b-*
- split: pile_deduped_410m
path: data/pile_deduped_410m-*
- split: pile_deduped_160m
path: data/pile_deduped_160m-*
- split: memories_duped_1b
path: data/memories_duped_1b-*
- split: pile_deduped_6.9b
path: data/pile_deduped_6.9b-*
- split: pile_duped_70m
path: data/pile_duped_70m-*
- split: memories_deduped_12b
path: data/memories_deduped_12b-*
- split: memories_deduped_2.8b
path: data/memories_deduped_2.8b-*
- split: memories_duped_1.4b
path: data/memories_duped_1.4b-*
- split: memories_duped_160m
path: data/memories_duped_160m-*
---
提供机构:
usvsnsp
原始信息汇总
数据集概述
数据集特征
数据集包含以下特征:
sequence_id: 整数类型,表示序列ID。text: 字符串类型,表示文本内容。is_incrementing: 布尔类型,表示是否递增。sequence_duplicates: 整数类型,表示序列重复次数。max_frequency: 整数类型,表示最大频率。avg_frequency: 浮点数类型,表示平均频率。min_frequency: 整数类型,表示最小频率。median_frequency: 浮点数类型,表示中位数频率。p25_frequency: 整数类型,表示25百分位频率。p75_frequency: 整数类型,表示75百分位频率。frequencies: 整数序列类型,表示频率。tokens: 整数序列类型,表示词元。repeating_offset: 整数类型,表示重复偏移量。num_repeating: 整数类型,表示重复次数。smallest_repeating_chunk: 整数序列类型,表示最小重复块。nl_scores: 浮点数类型,表示自然语言得分。0_8_snowclones: 整数类型,表示0.8版本的雪克隆。0_9_snowclones: 整数类型,表示0.9版本的雪克隆。0_8_templates: 整数类型,表示0.8版本的模板。0_9_templates: 整数类型,表示0.9版本的模板。huffman_coding_length: 浮点数类型,表示霍夫曼编码长度。memorization_score: 浮点数类型,表示记忆得分。generation_perplexity: 浮点数类型,表示生成困惑度。
数据集分割
数据集包含以下分割:
pile_duped_410m: 字节数为7123348430,示例数为5000000。memories_deduped_160m: 字节数为823500869,示例数为581200。pile_deduped_2.8b: 字节数为7134034756,示例数为5000000。pile_deduped_1b: 字节数为7133409756,示例数为5000000。memories_duped_2.8b: 字节数为2333155652,示例数为1675078。pile_duped_1.4b: 字节数为7123348430,示例数为5000000。pile_deduped_12b: 字节数为7133409756,示例数为5000000。memories_deduped_1.4b: 字节数为1475622003,示例数为1048097。memories_duped_410m: 字节数为1360848266,示例数为970344。pile_duped_2.8b: 字节数为7123348430,示例数为5000000。memories_deduped_410m: 字节数为1145472499,示例数为811040。memories_deduped_1b: 字节数为1455198363,示例数为1032872。memories_duped_12b: 字节数为3309825556,示例数为2382328。memories_deduped_70m: 字节数为586086312,示例数为411448。pile_duped_1b: 字节数为7123348430,示例数为5000000。pile_deduped_1.4b: 字节数为7134034756,示例数为5000000。pile_deduped_70m: 字节数为7133409756,示例数为5000000。pile_duped_6.9b: 字节数为7123348430,示例数为5000000。memories_duped_70m: 字节数为659286070,示例数为463960。pile_duped_12b: 字节数为7123348430,示例数为5000000。memories_deduped_6.9b: 字节数为2359576889,示例数为1680294。pile_duped_160m: 字节数为7123348430,示例数为5000000。memories_duped_6.9b: 字节数为2949197348,示例数为2120971。pile_deduped_410m: 字节数为7133409756,示例数为5000000。pile_deduped_160m: 字节数为7133409756,示例数为5000000。memories_duped_1b: 字节数为1755692131,示例数为1256144。pile_deduped_6.9b: 字节数为7134034756,示例数为5000000。pile_duped_70m: 字节数为7123348430,示例数为5000000。memories_deduped_12b: 字节数为2626258620,示例数为1871216。memories_deduped_2.8b: 字节数为1904601881,示例数为1355211。memories_duped_1.4b: 字节数为1917661495,示例数为1373723。memories_duped_160m: 字节数为972466846,示例数为689680。
数据集大小
- 下载大小:58080323885字节
- 数据集大小:141690391288字节
配置
- 配置名称:
default - 数据文件路径:
pile_duped_410m:data/pile_duped_410m-*memories_deduped_160m:data/memories_deduped_160m-*pile_deduped_2.8b:data/pile_deduped_2.8b-*pile_deduped_1b:data/pile_deduped_1b-*memories_duped_2.8b:data/memories_duped_2.8b-*pile_duped_1.4b:data/pile_duped_1.4b-*pile_deduped_12b:data/pile_deduped_12b-*memories_deduped_1.4b:data/memories_deduped_1.4b-*memories_duped_410m:data/memories_duped_410m-*pile_duped_2.8b:data/pile_duped_2.8b-*memories_deduped_410m:data/memories_deduped_410m-*memories_deduped_1b:data/memories_deduped_1b-*memories_duped_12b:data/memories_duped_12b-*memories_deduped_70m:data/memories_deduped_70m-*pile_duped_1b:data/pile_duped_1b-*pile_deduped_1.4b:data/pile_deduped_1.4b-*pile_deduped_70m:data/pile_deduped_70m-*pile_duped_6.9b:data/pile_duped_6.9b-*memories_duped_70m:data/memories_duped_70m-*pile_duped_12b:data/pile_duped_12b-*memories_deduped_6.9b:data/memories_deduped_6.9b-*pile_duped_160m:data/pile_duped_160m-*memories_duped_6.9b:data/memories_duped_6.9b-*pile_deduped_410m:data/pile_deduped_410m-*pile_deduped_160m:data/pile_deduped_160m-*memories_duped_1b:data/memories_duped_1b-*pile_deduped_6.9b:data/pile_deduped_6.9b-*pile_duped_70m:data/pile_duped_70m-*memories_deduped_12b:data/memories_deduped_12b-*memories_deduped_2.8b:data/memories_deduped_2.8b-*memories_duped_1.4b:data/memories_duped_1.4b-*memories_duped_160m:data/memories_duped_160m-*



