five

usvsnsp/generation-semantic-intermediate-filters

收藏
Hugging Face2024-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/generation-semantic-intermediate-filters
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: sequence_id dtype: int64 - name: tokens sequence: int64 - name: text dtype: string - name: is_incrementing dtype: bool - name: is_repeating dtype: bool - name: sequence_duplicates dtype: int64 - name: max_frequency dtype: int64 - name: avg_frequency dtype: float64 - name: min_frequency dtype: int64 - name: median_frequency dtype: float64 - name: p25_frequency dtype: int64 - name: p75_frequency dtype: int64 - name: frequencies sequence: int64 - name: nl_scores dtype: float32 - name: huffman_coding_length dtype: float64 - name: memorization_score dtype: float64 splits: - name: memories_deduped_12b.43000 num_bytes: 260439433 num_examples: 358863 - name: memories_deduped_12b.103000 num_bytes: 866085351 num_examples: 1195578 - name: memories_deduped_12b.83000 num_bytes: 617572913 num_examples: 852068 - name: memories_deduped_12b.63000 num_bytes: 424250618 num_examples: 585067 - name: memories_duped_12b.123000 num_bytes: 1430433551 num_examples: 1996011 - name: memories_duped_12b.103000 num_bytes: 1082995470 num_examples: 1510459 - name: memories_deduped_12b.123000 num_bytes: 1132764924 num_examples: 1564055 - name: memories_duped_12b.23000 num_bytes: 142780133 num_examples: 198175 - name: memories_duped_12b.43000 num_bytes: 318049401 num_examples: 442253 - name: memories_deduped_12b.23000 num_bytes: 118788383 num_examples: 163418 - name: memories_duped_12b.63000 num_bytes: 520315072 num_examples: 724678 - name: memories_duped_12b.83000 num_bytes: 766824742 num_examples: 1068501 download_size: 3067878899 dataset_size: 7681299991 configs: - config_name: default data_files: - split: memories_deduped_12b.43000 path: data/memories_deduped_12b.43000-* - split: memories_deduped_12b.103000 path: data/memories_deduped_12b.103000-* - split: memories_deduped_12b.83000 path: data/memories_deduped_12b.83000-* - split: memories_deduped_12b.63000 path: data/memories_deduped_12b.63000-* - split: memories_duped_12b.123000 path: data/memories_duped_12b.123000-* - split: memories_duped_12b.103000 path: data/memories_duped_12b.103000-* - split: memories_deduped_12b.123000 path: data/memories_deduped_12b.123000-* - split: memories_duped_12b.23000 path: data/memories_duped_12b.23000-* - split: memories_duped_12b.43000 path: data/memories_duped_12b.43000-* - split: memories_deduped_12b.23000 path: data/memories_deduped_12b.23000-* - split: memories_duped_12b.63000 path: data/memories_duped_12b.63000-* - split: memories_duped_12b.83000 path: data/memories_duped_12b.83000-* ---

数据集信息: 特征字段: - 名称:sequence_id,数据类型:64位整数 - 名称:tokens(Token),类型:64位整数序列 - 名称:text,数据类型:字符串 - 名称:is_incrementing,数据类型:布尔值 - 名称:is_repeating,数据类型:布尔值 - 名称:sequence_duplicates,数据类型:64位整数 - 名称:max_frequency,数据类型:64位整数 - 名称:avg_frequency,数据类型:64位浮点数 - 名称:min_frequency,数据类型:64位整数 - 名称:median_frequency,数据类型:64位浮点数 - 名称:p25_frequency,数据类型:64位整数 - 名称:p75_frequency,数据类型:64位整数 - 名称:frequencies,类型:64位整数序列 - 名称:nl_scores,数据类型:32位浮点数 - 名称:huffman_coding_length(霍夫曼编码长度),数据类型:64位浮点数 - 名称:memorization_score(记忆得分),数据类型:64位浮点数 数据划分: - 名称:memories_deduped_12b.43000,字节数:260439433,样本数:358863 - 名称:memories_deduped_12b.103000,字节数:866085351,样本数:1195578 - 名称:memories_deduped_12b.83000,字节数:617572913,样本数:852068 - 名称:memories_deduped_12b.63000,字节数:424250618,样本数:585067 - 名称:memories_duped_12b.123000,字节数:1430433551,样本数:1996011 - 名称:memories_duped_12b.103000,字节数:1082995470,样本数:1510459 - 名称:memories_deduped_12b.123000,字节数:1132764924,样本数:1564055 - 名称:memories_duped_12b.23000,字节数:142780133,样本数:198175 - 名称:memories_duped_12b.43000,字节数:318049401,样本数:442253 - 名称:memories_deduped_12b.23000,字节数:118788383,样本数:163418 - 名称:memories_duped_12b.63000,字节数:520315072,样本数:724678 - 名称:memories_duped_12b.83000,字节数:766824742,样本数:1068501 下载总大小:3067878899 字节 数据集总存储大小:7681299991 字节 数据集配置: - 配置名称:default(默认配置) 数据文件: - 划分:memories_deduped_12b.43000,路径:data/memories_deduped_12b.43000-* - 划分:memories_deduped_12b.103000,路径:data/memories_deduped_12b.103000-* - 划分:memories_deduped_12b.83000,路径:data/memories_deduped_12b.83000-* - 划分:memories_deduped_12b.63000,路径:data/memories_deduped_12b.63000-* - 划分:memories_duped_12b.123000,路径:data/memories_duped_12b.123000-* - 划分:memories_duped_12b.103000,路径:data/memories_duped_12b.103000-* - 划分:memories_deduped_12b.123000,路径:data/memories_deduped_12b.123000-* - 划分:memories_duped_12b.23000,路径:data/memories_duped_12b.23000-* - 划分:memories_duped_12b.43000,路径:data/memories_duped_12b.43000-* - 划分:memories_deduped_12b.23000,路径:data/memories_deduped_12b.23000-* - 划分:memories_duped_12b.63000,路径:data/memories_duped_12b.63000-* - 划分:memories_duped_12b.83000,路径:data/memories_duped_12b.83000-*
提供机构:
usvsnsp
原始信息汇总

数据集概述

数据集特征

  • sequence_id: 数据类型为 int64
  • tokens: 数据类型为 int64 的序列
  • text: 数据类型为 string
  • is_incrementing: 数据类型为 bool
  • is_repeating: 数据类型为 bool
  • sequence_duplicates: 数据类型为 int64
  • max_frequency: 数据类型为 int64
  • avg_frequency: 数据类型为 float64
  • min_frequency: 数据类型为 int64
  • median_frequency: 数据类型为 float64
  • p25_frequency: 数据类型为 int64
  • p75_frequency: 数据类型为 int64
  • frequencies: 数据类型为 int64 的序列
  • nl_scores: 数据类型为 float32
  • huffman_coding_length: 数据类型为 float64
  • memorization_score: 数据类型为 float64

数据集分割

  • memories_deduped_12b.43000: 字节数为 260439433,样本数为 358863
  • memories_deduped_12b.103000: 字节数为 866085351,样本数为 1195578
  • memories_deduped_12b.83000: 字节数为 617572913,样本数为 852068
  • memories_deduped_12b.63000: 字节数为 424250618,样本数为 585067
  • memories_duped_12b.123000: 字节数为 1430433551,样本数为 1996011
  • memories_duped_12b.103000: 字节数为 1082995470,样本数为 1510459
  • memories_deduped_12b.123000: 字节数为 1132764924,样本数为 1564055
  • memories_duped_12b.23000: 字节数为 142780133,样本数为 198175
  • memories_duped_12b.43000: 字节数为 318049401,样本数为 442253
  • memories_deduped_12b.23000: 字节数为 118788383,样本数为 163418
  • memories_duped_12b.63000: 字节数为 520315072,样本数为 724678
  • memories_duped_12b.83000: 字节数为 766824742,样本数为 1068501

数据集大小

  • 下载大小: 3067878899 字节
  • 数据集大小: 7681299991 字节

配置

  • 配置名称: default
    • 数据文件:
      • split: memories_deduped_12b.43000, path: data/memories_deduped_12b.43000-*
      • split: memories_deduped_12b.103000, path: data/memories_deduped_12b.103000-*
      • split: memories_deduped_12b.83000, path: data/memories_deduped_12b.83000-*
      • split: memories_deduped_12b.63000, path: data/memories_deduped_12b.63000-*
      • split: memories_duped_12b.123000, path: data/memories_duped_12b.123000-*
      • split: memories_duped_12b.103000, path: data/memories_duped_12b.103000-*
      • split: memories_deduped_12b.123000, path: data/memories_deduped_12b.123000-*
      • split: memories_duped_12b.23000, path: data/memories_duped_12b.23000-*
      • split: memories_duped_12b.43000, path: data/memories_duped_12b.43000-*
      • split: memories_deduped_12b.23000, path: data/memories_deduped_12b.23000-*
      • split: memories_duped_12b.63000, path: data/memories_duped_12b.63000-*
      • split: memories_duped_12b.83000, path: data/memories_duped_12b.83000-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作