usvsnsp/semantic-filters
收藏Hugging Face2024-06-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/semantic-filters
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: loss
dtype: float32
- name: prompt_perplexity
dtype: float32
- name: generation_perplexity
dtype: float32
- name: sequence_perplexity
dtype: float32
- name: text
dtype: string
- name: is_incrementing
dtype: bool
- name: is_repeating
dtype: bool
- name: sequence_duplicates
dtype: int64
- name: max_frequency
dtype: int64
- name: avg_frequency
dtype: float64
- name: min_frequency
dtype: int64
- name: median_frequency
dtype: float64
- name: p25_frequency
dtype: int64
- name: p75_frequency
dtype: int64
- name: frequencies
sequence: int64
- name: tokens
sequence: int64
- name: repeating_offset
dtype: int32
- name: num_repeating
dtype: int32
- name: smallest_repeating_chunk
sequence: int64
- name: nl_scores
dtype: float32
- name: 0_8_snowclones
dtype: int64
- name: 0_9_snowclones
dtype: int64
- name: 0_8_templates
dtype: int64
- name: 0_9_templates
dtype: int64
- name: huffman_coding_length
dtype: float64
- name: memorization_score
dtype: float64
splits:
- name: pile_duped_6.9b
num_bytes: 7163973430
num_examples: 5000000
- name: memories_duped_1b
num_bytes: 1765898301
num_examples: 1256144
- name: pile_deduped_1.4b
num_bytes: 7174034756
num_examples: 5000000
- name: memories_duped_6.9b
num_bytes: 2966172012
num_examples: 2120976
- name: pile_duped_2.8b
num_bytes: 7163973430
num_examples: 5000000
- name: memories_deduped_410m
num_bytes: 1152062199
num_examples: 811040
- name: memories_deduped_1.4b
num_bytes: 1484016295
num_examples: 1048104
- name: memories_duped_12b
num_bytes: 3329181971
num_examples: 2382328
- name: memories_deduped_6.9b
num_bytes: 2373022100
num_examples: 1680296
- name: pile_duped_160m
num_bytes: 7166473430
num_examples: 5000000
- name: memories_deduped_1b
num_bytes: 1463590448
num_examples: 1032872
- name: memories_duped_160m
num_bytes: 978415336
num_examples: 689680
- name: pile_deduped_410m
num_bytes: 7174034756
num_examples: 5000000
- name: pile_deduped_2.8b
num_bytes: 7174034756
num_examples: 5000000
- name: pile_deduped_160m
num_bytes: 7174034756
num_examples: 5000000
- name: memories_duped_1.4b
num_bytes: 1928658249
num_examples: 1373728
- name: pile_duped_1.4b
num_bytes: 7163973430
num_examples: 5000000
- name: memories_duped_70m
num_bytes: 663287725
num_examples: 463960
- name: pile_duped_1b
num_bytes: 7163973430
num_examples: 5000000
- name: memories_duped_410m
num_bytes: 1368732311
num_examples: 970344
- name: pile_deduped_6.9b
num_bytes: 7174034756
num_examples: 5000000
- name: pile_deduped_1b
num_bytes: 7174034756
num_examples: 5000000
- name: pile_duped_410m
num_bytes: 7163973430
num_examples: 5000000
- name: memories_deduped_70m
num_bytes: 589429327
num_examples: 411448
- name: memories_deduped_2.8b
num_bytes: 1915450723
num_examples: 1355216
- name: pile_duped_12b
num_bytes: 7163973430
num_examples: 5000000
- name: pile_deduped_12b
num_bytes: 7174034756
num_examples: 5000000
- name: memories_duped_2.8b
num_bytes: 2346559140
num_examples: 1675080
- name: memories_deduped_160m
num_bytes: 828223119
num_examples: 581200
- name: pile_duped_70m
num_bytes: 7166473430
num_examples: 5000000
- name: pile_deduped_70m
num_bytes: 7174034756
num_examples: 5000000
- name: memories_deduped_12b
num_bytes: 2641462250
num_examples: 1871216
download_size: 59010780947
dataset_size: 142503226994
configs:
- config_name: default
data_files:
- split: pile_duped_6.9b
path: data/pile_duped_6.9b-*
- split: memories_duped_1b
path: data/memories_duped_1b-*
- split: pile_deduped_1.4b
path: data/pile_deduped_1.4b-*
- split: memories_duped_6.9b
path: data/memories_duped_6.9b-*
- split: pile_duped_2.8b
path: data/pile_duped_2.8b-*
- split: memories_deduped_410m
path: data/memories_deduped_410m-*
- split: memories_deduped_1.4b
path: data/memories_deduped_1.4b-*
- split: memories_duped_12b
path: data/memories_duped_12b-*
- split: memories_deduped_6.9b
path: data/memories_deduped_6.9b-*
- split: pile_duped_160m
path: data/pile_duped_160m-*
- split: memories_deduped_1b
path: data/memories_deduped_1b-*
- split: memories_duped_160m
path: data/memories_duped_160m-*
- split: pile_deduped_410m
path: data/pile_deduped_410m-*
- split: pile_deduped_2.8b
path: data/pile_deduped_2.8b-*
- split: pile_deduped_160m
path: data/pile_deduped_160m-*
- split: memories_duped_1.4b
path: data/memories_duped_1.4b-*
- split: pile_duped_1.4b
path: data/pile_duped_1.4b-*
- split: memories_duped_70m
path: data/memories_duped_70m-*
- split: pile_duped_1b
path: data/pile_duped_1b-*
- split: memories_duped_410m
path: data/memories_duped_410m-*
- split: pile_deduped_6.9b
path: data/pile_deduped_6.9b-*
- split: pile_deduped_1b
path: data/pile_deduped_1b-*
- split: pile_duped_410m
path: data/pile_duped_410m-*
- split: memories_deduped_70m
path: data/memories_deduped_70m-*
- split: memories_deduped_2.8b
path: data/memories_deduped_2.8b-*
- split: pile_duped_12b
path: data/pile_duped_12b-*
- split: pile_deduped_12b
path: data/pile_deduped_12b-*
- split: memories_duped_2.8b
path: data/memories_duped_2.8b-*
- split: memories_deduped_160m
path: data/memories_deduped_160m-*
- split: pile_duped_70m
path: data/pile_duped_70m-*
- split: pile_deduped_70m
path: data/pile_deduped_70m-*
- split: memories_deduped_12b
path: data/memories_deduped_12b-*
---
提供机构:
usvsnsp
原始信息汇总
数据集特征
- sequence_id:整数类型 (int64)
- loss:浮点数类型 (float32)
- prompt_perplexity:浮点数类型 (float32)
- generation_perplexity:浮点数类型 (float32)
- sequence_perplexity:浮点数类型 (float32)
- text:字符串类型 (string)
- is_incrementing:布尔类型 (bool)
- is_repeating:布尔类型 (bool)
- sequence_duplicates:整数类型 (int64)
- max_frequency:整数类型 (int64)
- avg_frequency:浮点数类型 (float64)
- min_frequency:整数类型 (int64)
- median_frequency:浮点数类型 (float64)
- p25_frequency:整数类型 (int64)
- p75_frequency:整数类型 (int64)
- frequencies:序列类型,整数 (sequence: int64)
- tokens:序列类型,整数 (sequence: int64)
- repeating_offset:整数类型 (int32)
- num_repeating:整数类型 (int32)
- smallest_repeating_chunk:序列类型,整数 (sequence: int64)
- nl_scores:浮点数类型 (float32)
- 0_8_snowclones:整数类型 (int64)
- 0_9_snowclones:整数类型 (int64)
- 0_8_templates:整数类型 (int64)
- 0_9_templates:整数类型 (int64)
- huffman_coding_length:浮点数类型 (float64)
- memorization_score:浮点数类型 (float64)
数据集分割
- pile_duped_6.9b:5000000个样本,7163973430字节
- memories_duped_1b:1256144个样本,1765898301字节
- pile_deduped_1.4b:5000000个样本,7174034756字节
- memories_duped_6.9b:2120976个样本,2966172012字节
- pile_duped_2.8b:5000000个样本,7163973430字节
- memories_deduped_410m:811040个样本,1152062199字节
- memories_deduped_1.4b:1048104个样本,1484016295字节
- memories_duped_12b:2382328个样本,3329181971字节
- memories_deduped_6.9b:1680296个样本,2373022100字节
- pile_duped_160m:5000000个样本,7166473430字节
- memories_deduped_1b:1032872个样本,1463590448字节
- memories_duped_160m:689680个样本,978415336字节
- pile_deduped_410m:5000000个样本,7174034756字节
- pile_deduped_2.8b:5000000个样本,7174034756字节
- pile_deduped_160m:5000000个样本,7174034756字节
- memories_duped_1.4b:1373728个样本,1928658249字节
- pile_duped_1.4b:5000000个样本,7163973430字节
- memories_duped_70m:463960个样本,663287725字节
- pile_duped_1b:5000000个样本,7163973430字节
- memories_duped_410m:970344个样本,1368732311字节
- pile_deduped_6.9b:5000000个样本,7174034756字节
- pile_deduped_1b:5000000个样本,7174034756字节
- pile_duped_410m:5000000个样本,7163973430字节
- memories_deduped_70m:411448个样本,589429327字节
- memories_deduped_2.8b:1355216个样本,1915450723字节
- pile_duped_12b:5000000个样本,7163973430字节
- pile_deduped_12b:5000000个样本,7174034756字节
- memories_duped_2.8b:1675080个样本,2346559140字节
- memories_deduped_160m:581200个样本,828223119字节
- pile_duped_70m:5000000个样本,7166473430字节
- pile_deduped_70m:5000000个样本,7174034756字节
- memories_deduped_12b:1871216个样本,2641462250字节
数据集大小
- 下载大小:59010780947字节
- 数据集大小:142503226994字节



