usvsnsp/generation-semantic-intermediate-filters
收藏Hugging Face2024-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/generation-semantic-intermediate-filters
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: tokens
sequence: int64
- name: text
dtype: string
- name: is_incrementing
dtype: bool
- name: is_repeating
dtype: bool
- name: sequence_duplicates
dtype: int64
- name: max_frequency
dtype: int64
- name: avg_frequency
dtype: float64
- name: min_frequency
dtype: int64
- name: median_frequency
dtype: float64
- name: p25_frequency
dtype: int64
- name: p75_frequency
dtype: int64
- name: frequencies
sequence: int64
- name: nl_scores
dtype: float32
- name: huffman_coding_length
dtype: float64
- name: memorization_score
dtype: float64
splits:
- name: memories_deduped_12b.43000
num_bytes: 260439433
num_examples: 358863
- name: memories_deduped_12b.103000
num_bytes: 866085351
num_examples: 1195578
- name: memories_deduped_12b.83000
num_bytes: 617572913
num_examples: 852068
- name: memories_deduped_12b.63000
num_bytes: 424250618
num_examples: 585067
- name: memories_duped_12b.123000
num_bytes: 1430433551
num_examples: 1996011
- name: memories_duped_12b.103000
num_bytes: 1082995470
num_examples: 1510459
- name: memories_deduped_12b.123000
num_bytes: 1132764924
num_examples: 1564055
- name: memories_duped_12b.23000
num_bytes: 142780133
num_examples: 198175
- name: memories_duped_12b.43000
num_bytes: 318049401
num_examples: 442253
- name: memories_deduped_12b.23000
num_bytes: 118788383
num_examples: 163418
- name: memories_duped_12b.63000
num_bytes: 520315072
num_examples: 724678
- name: memories_duped_12b.83000
num_bytes: 766824742
num_examples: 1068501
download_size: 3067878899
dataset_size: 7681299991
configs:
- config_name: default
data_files:
- split: memories_deduped_12b.43000
path: data/memories_deduped_12b.43000-*
- split: memories_deduped_12b.103000
path: data/memories_deduped_12b.103000-*
- split: memories_deduped_12b.83000
path: data/memories_deduped_12b.83000-*
- split: memories_deduped_12b.63000
path: data/memories_deduped_12b.63000-*
- split: memories_duped_12b.123000
path: data/memories_duped_12b.123000-*
- split: memories_duped_12b.103000
path: data/memories_duped_12b.103000-*
- split: memories_deduped_12b.123000
path: data/memories_deduped_12b.123000-*
- split: memories_duped_12b.23000
path: data/memories_duped_12b.23000-*
- split: memories_duped_12b.43000
path: data/memories_duped_12b.43000-*
- split: memories_deduped_12b.23000
path: data/memories_deduped_12b.23000-*
- split: memories_duped_12b.63000
path: data/memories_duped_12b.63000-*
- split: memories_duped_12b.83000
path: data/memories_duped_12b.83000-*
---
数据集信息:
特征字段:
- 名称:sequence_id,数据类型:64位整数
- 名称:tokens(Token),类型:64位整数序列
- 名称:text,数据类型:字符串
- 名称:is_incrementing,数据类型:布尔值
- 名称:is_repeating,数据类型:布尔值
- 名称:sequence_duplicates,数据类型:64位整数
- 名称:max_frequency,数据类型:64位整数
- 名称:avg_frequency,数据类型:64位浮点数
- 名称:min_frequency,数据类型:64位整数
- 名称:median_frequency,数据类型:64位浮点数
- 名称:p25_frequency,数据类型:64位整数
- 名称:p75_frequency,数据类型:64位整数
- 名称:frequencies,类型:64位整数序列
- 名称:nl_scores,数据类型:32位浮点数
- 名称:huffman_coding_length(霍夫曼编码长度),数据类型:64位浮点数
- 名称:memorization_score(记忆得分),数据类型:64位浮点数
数据划分:
- 名称:memories_deduped_12b.43000,字节数:260439433,样本数:358863
- 名称:memories_deduped_12b.103000,字节数:866085351,样本数:1195578
- 名称:memories_deduped_12b.83000,字节数:617572913,样本数:852068
- 名称:memories_deduped_12b.63000,字节数:424250618,样本数:585067
- 名称:memories_duped_12b.123000,字节数:1430433551,样本数:1996011
- 名称:memories_duped_12b.103000,字节数:1082995470,样本数:1510459
- 名称:memories_deduped_12b.123000,字节数:1132764924,样本数:1564055
- 名称:memories_duped_12b.23000,字节数:142780133,样本数:198175
- 名称:memories_duped_12b.43000,字节数:318049401,样本数:442253
- 名称:memories_deduped_12b.23000,字节数:118788383,样本数:163418
- 名称:memories_duped_12b.63000,字节数:520315072,样本数:724678
- 名称:memories_duped_12b.83000,字节数:766824742,样本数:1068501
下载总大小:3067878899 字节
数据集总存储大小:7681299991 字节
数据集配置:
- 配置名称:default(默认配置)
数据文件:
- 划分:memories_deduped_12b.43000,路径:data/memories_deduped_12b.43000-*
- 划分:memories_deduped_12b.103000,路径:data/memories_deduped_12b.103000-*
- 划分:memories_deduped_12b.83000,路径:data/memories_deduped_12b.83000-*
- 划分:memories_deduped_12b.63000,路径:data/memories_deduped_12b.63000-*
- 划分:memories_duped_12b.123000,路径:data/memories_duped_12b.123000-*
- 划分:memories_duped_12b.103000,路径:data/memories_duped_12b.103000-*
- 划分:memories_deduped_12b.123000,路径:data/memories_deduped_12b.123000-*
- 划分:memories_duped_12b.23000,路径:data/memories_duped_12b.23000-*
- 划分:memories_duped_12b.43000,路径:data/memories_duped_12b.43000-*
- 划分:memories_deduped_12b.23000,路径:data/memories_deduped_12b.23000-*
- 划分:memories_duped_12b.63000,路径:data/memories_duped_12b.63000-*
- 划分:memories_duped_12b.83000,路径:data/memories_duped_12b.83000-*
提供机构:
usvsnsp
原始信息汇总
数据集概述
数据集特征
- sequence_id: 数据类型为
int64 - tokens: 数据类型为
int64的序列 - text: 数据类型为
string - is_incrementing: 数据类型为
bool - is_repeating: 数据类型为
bool - sequence_duplicates: 数据类型为
int64 - max_frequency: 数据类型为
int64 - avg_frequency: 数据类型为
float64 - min_frequency: 数据类型为
int64 - median_frequency: 数据类型为
float64 - p25_frequency: 数据类型为
int64 - p75_frequency: 数据类型为
int64 - frequencies: 数据类型为
int64的序列 - nl_scores: 数据类型为
float32 - huffman_coding_length: 数据类型为
float64 - memorization_score: 数据类型为
float64
数据集分割
- memories_deduped_12b.43000: 字节数为 260439433,样本数为 358863
- memories_deduped_12b.103000: 字节数为 866085351,样本数为 1195578
- memories_deduped_12b.83000: 字节数为 617572913,样本数为 852068
- memories_deduped_12b.63000: 字节数为 424250618,样本数为 585067
- memories_duped_12b.123000: 字节数为 1430433551,样本数为 1996011
- memories_duped_12b.103000: 字节数为 1082995470,样本数为 1510459
- memories_deduped_12b.123000: 字节数为 1132764924,样本数为 1564055
- memories_duped_12b.23000: 字节数为 142780133,样本数为 198175
- memories_duped_12b.43000: 字节数为 318049401,样本数为 442253
- memories_deduped_12b.23000: 字节数为 118788383,样本数为 163418
- memories_duped_12b.63000: 字节数为 520315072,样本数为 724678
- memories_duped_12b.83000: 字节数为 766824742,样本数为 1068501
数据集大小
- 下载大小: 3067878899 字节
- 数据集大小: 7681299991 字节
配置
- 配置名称: default
- 数据文件:
- split: memories_deduped_12b.43000, path: data/memories_deduped_12b.43000-*
- split: memories_deduped_12b.103000, path: data/memories_deduped_12b.103000-*
- split: memories_deduped_12b.83000, path: data/memories_deduped_12b.83000-*
- split: memories_deduped_12b.63000, path: data/memories_deduped_12b.63000-*
- split: memories_duped_12b.123000, path: data/memories_duped_12b.123000-*
- split: memories_duped_12b.103000, path: data/memories_duped_12b.103000-*
- split: memories_deduped_12b.123000, path: data/memories_deduped_12b.123000-*
- split: memories_duped_12b.23000, path: data/memories_duped_12b.23000-*
- split: memories_duped_12b.43000, path: data/memories_duped_12b.43000-*
- split: memories_deduped_12b.23000, path: data/memories_deduped_12b.23000-*
- split: memories_duped_12b.63000, path: data/memories_duped_12b.63000-*
- split: memories_duped_12b.83000, path: data/memories_duped_12b.83000-*
- 数据文件:



