usvsnsp/generation-semantic-filters
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/generation-semantic-filters
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: tokens
sequence: int64
- name: text
dtype: string
- name: is_incrementing
dtype: bool
- name: is_repeating
dtype: bool
- name: sequence_duplicates
dtype: int64
- name: max_frequency
dtype: int64
- name: avg_frequency
dtype: float64
- name: min_frequency
dtype: int64
- name: median_frequency
dtype: float64
- name: p25_frequency
dtype: int64
- name: p75_frequency
dtype: int64
- name: frequencies
sequence: int64
- name: nl_scores
dtype: float32
- name: 0_8_snowclones
dtype: int64
- name: 0_9_snowclones
dtype: int64
- name: 0_8_templates
dtype: int64
- name: 0_9_templates
dtype: int64
- name: huffman_coding_length
dtype: float64
- name: memorization_score
dtype: float64
- name: index
dtype: int64
- name: loss
dtype: float32
- name: prompt_perplexity
dtype: float32
- name: generation_perplexity
dtype: float32
- name: sequence_perplexity
dtype: float32
splits:
- name: pile_duped_6.9b
num_bytes: 3952510656
num_examples: 5000000
- name: memories_duped_1b
num_bytes: 975471445
num_examples: 1256144
- name: pile_deduped_1.4b
num_bytes: 3957538226
num_examples: 5000000
- name: memories_duped_6.9b
num_bytes: 1639665283
num_examples: 2120976
- name: pile_duped_2.8b
num_bytes: 3952510656
num_examples: 5000000
- name: memories_deduped_410m
num_bytes: 636147799
num_examples: 811040
- name: memories_deduped_1.4b
num_bytes: 819760730
num_examples: 1048104
- name: memories_duped_12b
num_bytes: 1840563478
num_examples: 2382328
- name: memories_deduped_6.9b
num_bytes: 1311517858
num_examples: 1680296
- name: pile_duped_160m
num_bytes: 3955010656
num_examples: 5000000
- name: memories_deduped_1b
num_bytes: 808469787
num_examples: 1032872
- name: memories_duped_160m
num_bytes: 540094560
num_examples: 689680
- name: pile_deduped_410m
num_bytes: 3957538226
num_examples: 5000000
- name: pile_deduped_2.8b
num_bytes: 3957538226
num_examples: 5000000
- name: pile_deduped_160m
num_bytes: 3957538226
num_examples: 5000000
- name: memories_duped_1.4b
num_bytes: 1065534772
num_examples: 1373728
- name: pile_duped_1.4b
num_bytes: 3952510656
num_examples: 5000000
- name: memories_duped_70m
num_bytes: 365918006
num_examples: 463960
- name: pile_duped_1b
num_bytes: 3952510656
num_examples: 5000000
- name: memories_duped_410m
num_bytes: 755826030
num_examples: 970344
- name: pile_deduped_6.9b
num_bytes: 3957538226
num_examples: 5000000
- name: pile_deduped_1b
num_bytes: 3957538226
num_examples: 5000000
- name: pile_duped_410m
num_bytes: 3952510656
num_examples: 5000000
- name: memories_deduped_70m
num_bytes: 325241847
num_examples: 411448
- name: memories_deduped_2.8b
num_bytes: 1058394856
num_examples: 1355216
- name: pile_duped_12b
num_bytes: 3952510656
num_examples: 5000000
- name: pile_deduped_12b
num_bytes: 3957538226
num_examples: 5000000
- name: memories_duped_2.8b
num_bytes: 1296804964
num_examples: 1675080
- name: memories_deduped_160m
num_bytes: 457088933
num_examples: 581200
- name: pile_duped_70m
num_bytes: 3955010656
num_examples: 5000000
- name: pile_deduped_70m
num_bytes: 3957538226
num_examples: 5000000
- name: memories_deduped_12b
num_bytes: 1460036167
num_examples: 1871216
download_size: 36995035353
dataset_size: 80101963738
configs:
- config_name: default
data_files:
- split: pile_duped_6.9b
path: data/pile_duped_6.9b-*
- split: memories_duped_1b
path: data/memories_duped_1b-*
- split: pile_deduped_1.4b
path: data/pile_deduped_1.4b-*
- split: memories_duped_6.9b
path: data/memories_duped_6.9b-*
- split: pile_duped_2.8b
path: data/pile_duped_2.8b-*
- split: memories_deduped_410m
path: data/memories_deduped_410m-*
- split: memories_deduped_1.4b
path: data/memories_deduped_1.4b-*
- split: memories_duped_12b
path: data/memories_duped_12b-*
- split: memories_deduped_6.9b
path: data/memories_deduped_6.9b-*
- split: pile_duped_160m
path: data/pile_duped_160m-*
- split: memories_deduped_1b
path: data/memories_deduped_1b-*
- split: memories_duped_160m
path: data/memories_duped_160m-*
- split: pile_deduped_410m
path: data/pile_deduped_410m-*
- split: pile_deduped_2.8b
path: data/pile_deduped_2.8b-*
- split: pile_deduped_160m
path: data/pile_deduped_160m-*
- split: memories_duped_1.4b
path: data/memories_duped_1.4b-*
- split: pile_duped_1.4b
path: data/pile_duped_1.4b-*
- split: memories_duped_70m
path: data/memories_duped_70m-*
- split: pile_duped_1b
path: data/pile_duped_1b-*
- split: memories_duped_410m
path: data/memories_duped_410m-*
- split: pile_deduped_6.9b
path: data/pile_deduped_6.9b-*
- split: pile_deduped_1b
path: data/pile_deduped_1b-*
- split: pile_duped_410m
path: data/pile_duped_410m-*
- split: memories_deduped_70m
path: data/memories_deduped_70m-*
- split: memories_deduped_2.8b
path: data/memories_deduped_2.8b-*
- split: pile_duped_12b
path: data/pile_duped_12b-*
- split: pile_deduped_12b
path: data/pile_deduped_12b-*
- split: memories_duped_2.8b
path: data/memories_duped_2.8b-*
- split: memories_deduped_160m
path: data/memories_deduped_160m-*
- split: pile_duped_70m
path: data/pile_duped_70m-*
- split: pile_deduped_70m
path: data/pile_deduped_70m-*
- split: memories_deduped_12b
path: data/memories_deduped_12b-*
---
提供机构:
usvsnsp
原始信息汇总
数据集特征
数据集包含以下特征:
sequence_id: 数据类型为int64tokens: 数据类型为int64的序列text: 数据类型为stringis_incrementing: 数据类型为boolis_repeating: 数据类型为boolsequence_duplicates: 数据类型为int64max_frequency: 数据类型为int64avg_frequency: 数据类型为float64min_frequency: 数据类型为int64median_frequency: 数据类型为float64p25_frequency: 数据类型为int64p75_frequency: 数据类型为int64frequencies: 数据类型为int64的序列nl_scores: 数据类型为float320_8_snowclones: 数据类型为int640_9_snowclones: 数据类型为int640_8_templates: 数据类型为int640_9_templates: 数据类型为int64huffman_coding_length: 数据类型为float64memorization_score: 数据类型为float64index: 数据类型为int64loss: 数据类型为float32prompt_perplexity: 数据类型为float32generation_perplexity: 数据类型为float32sequence_perplexity: 数据类型为float32
数据集分割
数据集包含以下分割:
pile_duped_6.9b: 字节数为3952510656,示例数为5000000memories_duped_1b: 字节数为975471445,示例数为1256144pile_deduped_1.4b: 字节数为3957538226,示例数为5000000memories_duped_6.9b: 字节数为1639665283,示例数为2120976pile_duped_2.8b: 字节数为3952510656,示例数为5000000memories_deduped_410m: 字节数为636147799,示例数为811040memories_deduped_1.4b: 字节数为819760730,示例数为1048104memories_duped_12b: 字节数为1840563478,示例数为2382328memories_deduped_6.9b: 字节数为1311517858,示例数为1680296pile_duped_160m: 字节数为3955010656,示例数为5000000memories_deduped_1b: 字节数为808469787,示例数为1032872memories_duped_160m: 字节数为540094560,示例数为689680pile_deduped_410m: 字节数为3957538226,示例数为5000000pile_deduped_2.8b: 字节数为3957538226,示例数为5000000pile_deduped_160m: 字节数为3957538226,示例数为5000000memories_duped_1.4b: 字节数为1065534772,示例数为1373728pile_duped_1.4b: 字节数为3952510656,示例数为5000000memories_duped_70m: 字节数为365918006,示例数为463960pile_duped_1b: 字节数为3952510656,示例数为5000000memories_duped_410m: 字节数为755826030,示例数为970344pile_deduped_6.9b: 字节数为3957538226,示例数为5000000pile_deduped_1b: 字节数为3957538226,示例数为5000000pile_duped_410m: 字节数为3952510656,示例数为5000000memories_deduped_70m: 字节数为325241847,示例数为411448memories_deduped_2.8b: 字节数为1058394856,示例数为1355216pile_duped_12b: 字节数为3952510656,示例数为5000000pile_deduped_12b: 字节数为3957538226,示例数为5000000memories_duped_2.8b: 字节数为1296804964,示例数为1675080memories_deduped_160m: 字节数为457088933,示例数为581200pile_duped_70m: 字节数为3955010656,示例数为5000000pile_deduped_70m: 字节数为3957538226,示例数为5000000memories_deduped_12b: 字节数为1460036167,示例数为1871216
数据集大小
- 下载大小:
36995035353字节 - 数据集大小:
80101963738字节
配置
- 配置名称:
default - 数据文件:
pile_duped_6.9b: 路径为data/pile_duped_6.9b-*memories_duped_1b: 路径为data/memories_duped_1b-*pile_deduped_1.4b: 路径为data/pile_deduped_1.4b-*memories_duped_6.9b: 路径为data/memories_duped_6.9b-*pile_duped_2.8b: 路径为data/pile_duped_2.8b-*memories_deduped_410m: 路径为data/memories_deduped_410m-*memories_deduped_1.4b: 路径为data/memories_deduped_1.4b-*memories_duped_12b: 路径为data/memories_duped_12b-*memories_deduped_6.9b: 路径为data/memories_deduped_6.9b-*pile_duped_160m: 路径为data/pile_duped_160m-*memories_deduped_1b: 路径为data/memories_deduped_1b-*memories_duped_160m: 路径为data/memories_duped_160m-*pile_deduped_410m: 路径为data/pile_deduped_410m-*pile_deduped_2.8b: 路径为data/pile_deduped_2.8b-*pile_deduped_160m: 路径为data/pile_deduped_160m-*memories_duped_1.4b: 路径为data/memories_duped_1.4b-*pile_duped_1.4b: 路径为data/pile_duped_1.4b-*memories_duped_70m: 路径为data/memories_duped_70m-*pile_duped_1b: 路径为data/pile_duped_1b-*memories_duped_410m: 路径为data/memories_duped_410m-*pile_deduped_6.9b: 路径为data/pile_deduped_6.9b-*pile_deduped_1b: 路径为data/pile_deduped_1b-*pile_duped_410m: 路径为data/pile_duped_410m-*memories_deduped_70m: 路径为data/memories_deduped_70m-*memories_deduped_2.8b: 路径为data/memories_deduped_2.8b-*pile_duped_12b: 路径为data/pile_duped_12b-*pile_deduped_12b: 路径为data/pile_deduped_12b-*memories_duped_2.8b: 路径为data/memories_duped_2.8b-*memories_deduped_160m: 路径为data/memories_deduped_160m-*pile_duped_70m: 路径为data/pile_duped_70m-*pile_deduped_70m: 路径为data/pile_deduped_70m-*memories_deduped_12b: 路径为data/memories_deduped_12b-*



