five

usvsnsp/semantic-filters

收藏
Hugging Face2024-06-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/semantic-filters
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: sequence_id dtype: int64 - name: loss dtype: float32 - name: prompt_perplexity dtype: float32 - name: generation_perplexity dtype: float32 - name: sequence_perplexity dtype: float32 - name: text dtype: string - name: is_incrementing dtype: bool - name: is_repeating dtype: bool - name: sequence_duplicates dtype: int64 - name: max_frequency dtype: int64 - name: avg_frequency dtype: float64 - name: min_frequency dtype: int64 - name: median_frequency dtype: float64 - name: p25_frequency dtype: int64 - name: p75_frequency dtype: int64 - name: frequencies sequence: int64 - name: tokens sequence: int64 - name: repeating_offset dtype: int32 - name: num_repeating dtype: int32 - name: smallest_repeating_chunk sequence: int64 - name: nl_scores dtype: float32 - name: 0_8_snowclones dtype: int64 - name: 0_9_snowclones dtype: int64 - name: 0_8_templates dtype: int64 - name: 0_9_templates dtype: int64 - name: huffman_coding_length dtype: float64 - name: memorization_score dtype: float64 splits: - name: pile_duped_6.9b num_bytes: 7163973430 num_examples: 5000000 - name: memories_duped_1b num_bytes: 1765898301 num_examples: 1256144 - name: pile_deduped_1.4b num_bytes: 7174034756 num_examples: 5000000 - name: memories_duped_6.9b num_bytes: 2966172012 num_examples: 2120976 - name: pile_duped_2.8b num_bytes: 7163973430 num_examples: 5000000 - name: memories_deduped_410m num_bytes: 1152062199 num_examples: 811040 - name: memories_deduped_1.4b num_bytes: 1484016295 num_examples: 1048104 - name: memories_duped_12b num_bytes: 3329181971 num_examples: 2382328 - name: memories_deduped_6.9b num_bytes: 2373022100 num_examples: 1680296 - name: pile_duped_160m num_bytes: 7166473430 num_examples: 5000000 - name: memories_deduped_1b num_bytes: 1463590448 num_examples: 1032872 - name: memories_duped_160m num_bytes: 978415336 num_examples: 689680 - name: pile_deduped_410m num_bytes: 7174034756 num_examples: 5000000 - name: pile_deduped_2.8b num_bytes: 7174034756 num_examples: 5000000 - name: pile_deduped_160m num_bytes: 7174034756 num_examples: 5000000 - name: memories_duped_1.4b num_bytes: 1928658249 num_examples: 1373728 - name: pile_duped_1.4b num_bytes: 7163973430 num_examples: 5000000 - name: memories_duped_70m num_bytes: 663287725 num_examples: 463960 - name: pile_duped_1b num_bytes: 7163973430 num_examples: 5000000 - name: memories_duped_410m num_bytes: 1368732311 num_examples: 970344 - name: pile_deduped_6.9b num_bytes: 7174034756 num_examples: 5000000 - name: pile_deduped_1b num_bytes: 7174034756 num_examples: 5000000 - name: pile_duped_410m num_bytes: 7163973430 num_examples: 5000000 - name: memories_deduped_70m num_bytes: 589429327 num_examples: 411448 - name: memories_deduped_2.8b num_bytes: 1915450723 num_examples: 1355216 - name: pile_duped_12b num_bytes: 7163973430 num_examples: 5000000 - name: pile_deduped_12b num_bytes: 7174034756 num_examples: 5000000 - name: memories_duped_2.8b num_bytes: 2346559140 num_examples: 1675080 - name: memories_deduped_160m num_bytes: 828223119 num_examples: 581200 - name: pile_duped_70m num_bytes: 7166473430 num_examples: 5000000 - name: pile_deduped_70m num_bytes: 7174034756 num_examples: 5000000 - name: memories_deduped_12b num_bytes: 2641462250 num_examples: 1871216 download_size: 59010780947 dataset_size: 142503226994 configs: - config_name: default data_files: - split: pile_duped_6.9b path: data/pile_duped_6.9b-* - split: memories_duped_1b path: data/memories_duped_1b-* - split: pile_deduped_1.4b path: data/pile_deduped_1.4b-* - split: memories_duped_6.9b path: data/memories_duped_6.9b-* - split: pile_duped_2.8b path: data/pile_duped_2.8b-* - split: memories_deduped_410m path: data/memories_deduped_410m-* - split: memories_deduped_1.4b path: data/memories_deduped_1.4b-* - split: memories_duped_12b path: data/memories_duped_12b-* - split: memories_deduped_6.9b path: data/memories_deduped_6.9b-* - split: pile_duped_160m path: data/pile_duped_160m-* - split: memories_deduped_1b path: data/memories_deduped_1b-* - split: memories_duped_160m path: data/memories_duped_160m-* - split: pile_deduped_410m path: data/pile_deduped_410m-* - split: pile_deduped_2.8b path: data/pile_deduped_2.8b-* - split: pile_deduped_160m path: data/pile_deduped_160m-* - split: memories_duped_1.4b path: data/memories_duped_1.4b-* - split: pile_duped_1.4b path: data/pile_duped_1.4b-* - split: memories_duped_70m path: data/memories_duped_70m-* - split: pile_duped_1b path: data/pile_duped_1b-* - split: memories_duped_410m path: data/memories_duped_410m-* - split: pile_deduped_6.9b path: data/pile_deduped_6.9b-* - split: pile_deduped_1b path: data/pile_deduped_1b-* - split: pile_duped_410m path: data/pile_duped_410m-* - split: memories_deduped_70m path: data/memories_deduped_70m-* - split: memories_deduped_2.8b path: data/memories_deduped_2.8b-* - split: pile_duped_12b path: data/pile_duped_12b-* - split: pile_deduped_12b path: data/pile_deduped_12b-* - split: memories_duped_2.8b path: data/memories_duped_2.8b-* - split: memories_deduped_160m path: data/memories_deduped_160m-* - split: pile_duped_70m path: data/pile_duped_70m-* - split: pile_deduped_70m path: data/pile_deduped_70m-* - split: memories_deduped_12b path: data/memories_deduped_12b-* ---
提供机构:
usvsnsp
原始信息汇总

数据集特征

  • sequence_id:整数类型 (int64)
  • loss:浮点数类型 (float32)
  • prompt_perplexity:浮点数类型 (float32)
  • generation_perplexity:浮点数类型 (float32)
  • sequence_perplexity:浮点数类型 (float32)
  • text:字符串类型 (string)
  • is_incrementing:布尔类型 (bool)
  • is_repeating:布尔类型 (bool)
  • sequence_duplicates:整数类型 (int64)
  • max_frequency:整数类型 (int64)
  • avg_frequency:浮点数类型 (float64)
  • min_frequency:整数类型 (int64)
  • median_frequency:浮点数类型 (float64)
  • p25_frequency:整数类型 (int64)
  • p75_frequency:整数类型 (int64)
  • frequencies:序列类型,整数 (sequence: int64)
  • tokens:序列类型,整数 (sequence: int64)
  • repeating_offset:整数类型 (int32)
  • num_repeating:整数类型 (int32)
  • smallest_repeating_chunk:序列类型,整数 (sequence: int64)
  • nl_scores:浮点数类型 (float32)
  • 0_8_snowclones:整数类型 (int64)
  • 0_9_snowclones:整数类型 (int64)
  • 0_8_templates:整数类型 (int64)
  • 0_9_templates:整数类型 (int64)
  • huffman_coding_length:浮点数类型 (float64)
  • memorization_score:浮点数类型 (float64)

数据集分割

  • pile_duped_6.9b:5000000个样本,7163973430字节
  • memories_duped_1b:1256144个样本,1765898301字节
  • pile_deduped_1.4b:5000000个样本,7174034756字节
  • memories_duped_6.9b:2120976个样本,2966172012字节
  • pile_duped_2.8b:5000000个样本,7163973430字节
  • memories_deduped_410m:811040个样本,1152062199字节
  • memories_deduped_1.4b:1048104个样本,1484016295字节
  • memories_duped_12b:2382328个样本,3329181971字节
  • memories_deduped_6.9b:1680296个样本,2373022100字节
  • pile_duped_160m:5000000个样本,7166473430字节
  • memories_deduped_1b:1032872个样本,1463590448字节
  • memories_duped_160m:689680个样本,978415336字节
  • pile_deduped_410m:5000000个样本,7174034756字节
  • pile_deduped_2.8b:5000000个样本,7174034756字节
  • pile_deduped_160m:5000000个样本,7174034756字节
  • memories_duped_1.4b:1373728个样本,1928658249字节
  • pile_duped_1.4b:5000000个样本,7163973430字节
  • memories_duped_70m:463960个样本,663287725字节
  • pile_duped_1b:5000000个样本,7163973430字节
  • memories_duped_410m:970344个样本,1368732311字节
  • pile_deduped_6.9b:5000000个样本,7174034756字节
  • pile_deduped_1b:5000000个样本,7174034756字节
  • pile_duped_410m:5000000个样本,7163973430字节
  • memories_deduped_70m:411448个样本,589429327字节
  • memories_deduped_2.8b:1355216个样本,1915450723字节
  • pile_duped_12b:5000000个样本,7163973430字节
  • pile_deduped_12b:5000000个样本,7174034756字节
  • memories_duped_2.8b:1675080个样本,2346559140字节
  • memories_deduped_160m:581200个样本,828223119字节
  • pile_duped_70m:5000000个样本,7166473430字节
  • pile_deduped_70m:5000000个样本,7174034756字节
  • memories_deduped_12b:1871216个样本,2641462250字节

数据集大小

  • 下载大小:59010780947字节
  • 数据集大小:142503226994字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作