usvsnsp/semantic-filters-intermediate
收藏Hugging Face2024-06-05 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/semantic-filters-intermediate
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: text
dtype: string
- name: is_incrementing
dtype: bool
- name: is_repeating
dtype: bool
- name: sequence_duplicates
dtype: int64
- name: max_frequency
dtype: int64
- name: avg_frequency
dtype: float64
- name: min_frequency
dtype: int64
- name: median_frequency
dtype: float64
- name: p25_frequency
dtype: int64
- name: p75_frequency
dtype: int64
- name: frequencies
sequence: int64
- name: tokens
sequence: int64
- name: repeating_offset
dtype: int32
- name: num_repeating
dtype: int32
- name: smallest_repeating_chunk
sequence: int64
- name: nl_scores
dtype: float32
- name: huffman_coding_length
dtype: float64
- name: memorization_score
dtype: float64
splits:
- name: memories_duped_12b.103000
num_bytes: 2039692286
num_examples: 1510459
- name: memories_duped_12b.123000
num_bytes: 2693829786
num_examples: 1996011
- name: memories_deduped_12b.103000
num_bytes: 1630873156
num_examples: 1195578
- name: memories_duped_12b.63000
num_bytes: 980134587
num_examples: 724678
- name: memories_deduped_12b.63000
num_bytes: 799040084
num_examples: 585067
- name: memories_deduped_12b.23000
num_bytes: 223774463
num_examples: 163418
- name: memories_deduped_12b.123000
num_bytes: 2132851472
num_examples: 1564055
- name: memories_duped_12b.83000
num_bytes: 1444337301
num_examples: 1068501
- name: memories_duped_12b.23000
num_bytes: 269043600
num_examples: 198175
- name: memories_deduped_12b.83000
num_bytes: 1163105937
num_examples: 852068
- name: memories_duped_12b.43000
num_bytes: 599186210
num_examples: 442253
- name: memories_deduped_12b.43000
num_bytes: 490567091
num_examples: 358863
- name: memories_deduped_12b.143000
num_bytes: 2551643882
num_examples: 1871216
- name: memories_duped_12b.143000
num_bytes: 3214830227
num_examples: 2382328
download_size: 7263364113
dataset_size: 20232910082
configs:
- config_name: default
data_files:
- split: memories_duped_12b.103000
path: data/memories_duped_12b.103000-*
- split: memories_duped_12b.123000
path: data/memories_duped_12b.123000-*
- split: memories_deduped_12b.103000
path: data/memories_deduped_12b.103000-*
- split: memories_duped_12b.63000
path: data/memories_duped_12b.63000-*
- split: memories_deduped_12b.63000
path: data/memories_deduped_12b.63000-*
- split: memories_deduped_12b.23000
path: data/memories_deduped_12b.23000-*
- split: memories_deduped_12b.123000
path: data/memories_deduped_12b.123000-*
- split: memories_duped_12b.83000
path: data/memories_duped_12b.83000-*
- split: memories_duped_12b.23000
path: data/memories_duped_12b.23000-*
- split: memories_deduped_12b.83000
path: data/memories_deduped_12b.83000-*
- split: memories_duped_12b.43000
path: data/memories_duped_12b.43000-*
- split: memories_deduped_12b.43000
path: data/memories_deduped_12b.43000-*
- split: memories_deduped_12b.143000
path: data/memories_deduped_12b.143000-*
- split: memories_duped_12b.143000
path: data/memories_duped_12b.143000-*
---
提供机构:
usvsnsp
原始信息汇总
数据集概述
数据集特征
- sequence_id: 整数类型 (int64)
- text: 字符串类型 (string)
- is_incrementing: 布尔类型 (bool)
- is_repeating: 布尔类型 (bool)
- sequence_duplicates: 整数类型 (int64)
- max_frequency: 整数类型 (int64)
- avg_frequency: 浮点数类型 (float64)
- min_frequency: 整数类型 (int64)
- median_frequency: 浮点数类型 (float64)
- p25_frequency: 整数类型 (int64)
- p75_frequency: 整数类型 (int64)
- frequencies: 序列类型 (sequence: int64)
- tokens: 序列类型 (sequence: int64)
- repeating_offset: 整数类型 (int32)
- num_repeating: 整数类型 (int32)
- smallest_repeating_chunk: 序列类型 (sequence: int64)
- nl_scores: 浮点数类型 (float32)
- huffman_coding_length: 浮点数类型 (float64)
- memorization_score: 浮点数类型 (float64)
数据集分割
- memories_duped_12b.103000: 2039692286 字节, 1510459 样本
- memories_duped_12b.123000: 2693829786 字节, 1996011 样本
- memories_deduped_12b.103000: 1630873156 字节, 1195578 样本
- memories_duped_12b.63000: 980134587 字节, 724678 样本
- memories_deduped_12b.63000: 799040084 字节, 585067 样本
- memories_deduped_12b.23000: 223774463 字节, 163418 样本
- memories_deduped_12b.123000: 2132851472 字节, 1564055 样本
- memories_duped_12b.83000: 1444337301 字节, 1068501 样本
- memories_duped_12b.23000: 269043600 字节, 198175 样本
- memories_deduped_12b.83000: 1163105937 字节, 852068 样本
- memories_duped_12b.43000: 599186210 字节, 442253 样本
- memories_deduped_12b.43000: 490567091 字节, 358863 样本
- memories_deduped_12b.143000: 2551643882 字节, 1871216 样本
- memories_duped_12b.143000: 3214830227 字节, 2382328 样本
数据集大小
- 下载大小: 7263364113 字节
- 数据集大小: 20232910082 字节



