usvsnsp/semantic-duplicates
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usvsnsp/semantic-duplicates
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: index
dtype: int64
- name: 0.9_frequencies
dtype: int64
- name: 0.8_frequencies
dtype: int64
splits:
- name: duped_2.8b_snowclones
num_bytes: 40201848
num_examples: 1675077
- name: duped_6.9b_templates
num_bytes: 50903256
num_examples: 2120969
- name: deduped_6.9b_templates
num_bytes: 40327056
num_examples: 1680294
- name: deduped_1.4b_templates
num_bytes: 25154328
num_examples: 1048097
- name: deduped_snowclones
num_bytes: 120000000
num_examples: 5000000
- name: duped_1b_templates
num_bytes: 30147384
num_examples: 1256141
- name: duped_12b_templates
num_bytes: 57175824
num_examples: 2382326
- name: deduped_160m_snowclones
num_bytes: 13948680
num_examples: 581195
- name: deduped_1b_snowclones
num_bytes: 24788760
num_examples: 1032865
- name: duped_70m_snowclones
num_bytes: 11134872
num_examples: 463953
- name: deduped_1.4b_snowclones
num_bytes: 25154328
num_examples: 1048097
- name: duped_1.4b_templates
num_bytes: 32969328
num_examples: 1373722
- name: deduped_1b_templates
num_bytes: 24788760
num_examples: 1032865
- name: deduped_2.8b_templates
num_bytes: 32525064
num_examples: 1355211
- name: duped_2.8b_templates
num_bytes: 40201848
num_examples: 1675077
- name: duped_6.9b_snowclones
num_bytes: 50903256
num_examples: 2120969
- name: duped_410m_snowclones
num_bytes: 23288184
num_examples: 970341
- name: deduped_410m_templates
num_bytes: 19464936
num_examples: 811039
- name: duped_410m_templates
num_bytes: 23288184
num_examples: 970341
- name: deduped_160m_templates
num_bytes: 13948680
num_examples: 581195
- name: deduped_70m_templates
num_bytes: 9874752
num_examples: 411448
- name: duped_160m_templates
num_bytes: 16552152
num_examples: 689673
- name: duped_12b_snowclones
num_bytes: 57175824
num_examples: 2382326
- name: duped_snowclones
num_bytes: 120000000
num_examples: 5000000
- name: deduped_2.8b_snowclones
num_bytes: 32525064
num_examples: 1355211
- name: deduped_410m_snowclones
num_bytes: 19464936
num_examples: 811039
- name: duped_160m_snowclones
num_bytes: 16552152
num_examples: 689673
- name: deduped_6.9b_snowclones
num_bytes: 40327056
num_examples: 1680294
- name: deduped_70m_snowclones
num_bytes: 9874752
num_examples: 411448
- name: duped_1b_snowclones
num_bytes: 30147384
num_examples: 1256141
- name: duped_1.4b_snowclones
num_bytes: 32969328
num_examples: 1373722
- name: duped_70m_templates
num_bytes: 11134872
num_examples: 463953
- name: duped_templates
num_bytes: 120000000
num_examples: 5000000
- name: deduped_templates
num_bytes: 120000000
num_examples: 5000000
- name: deduped_12b_templates
num_bytes: 44909160
num_examples: 1871215
- name: deduped_12b_snowclones
num_bytes: 44909160
num_examples: 1871215
download_size: 531300635
dataset_size: 1516549488
configs:
- config_name: default
data_files:
- split: duped_2.8b_snowclones
path: data/duped_2.8b_snowclones-*
- split: duped_6.9b_templates
path: data/duped_6.9b_templates-*
- split: deduped_6.9b_templates
path: data/deduped_6.9b_templates-*
- split: deduped_1.4b_templates
path: data/deduped_1.4b_templates-*
- split: deduped_snowclones
path: data/deduped_snowclones-*
- split: duped_1b_templates
path: data/duped_1b_templates-*
- split: duped_12b_templates
path: data/duped_12b_templates-*
- split: deduped_160m_snowclones
path: data/deduped_160m_snowclones-*
- split: deduped_1b_snowclones
path: data/deduped_1b_snowclones-*
- split: duped_70m_snowclones
path: data/duped_70m_snowclones-*
- split: deduped_1.4b_snowclones
path: data/deduped_1.4b_snowclones-*
- split: duped_1.4b_templates
path: data/duped_1.4b_templates-*
- split: deduped_1b_templates
path: data/deduped_1b_templates-*
- split: deduped_2.8b_templates
path: data/deduped_2.8b_templates-*
- split: duped_2.8b_templates
path: data/duped_2.8b_templates-*
- split: duped_6.9b_snowclones
path: data/duped_6.9b_snowclones-*
- split: duped_410m_snowclones
path: data/duped_410m_snowclones-*
- split: deduped_410m_templates
path: data/deduped_410m_templates-*
- split: duped_410m_templates
path: data/duped_410m_templates-*
- split: deduped_160m_templates
path: data/deduped_160m_templates-*
- split: deduped_70m_templates
path: data/deduped_70m_templates-*
- split: duped_160m_templates
path: data/duped_160m_templates-*
- split: duped_12b_snowclones
path: data/duped_12b_snowclones-*
- split: duped_snowclones
path: data/duped_snowclones-*
- split: deduped_2.8b_snowclones
path: data/deduped_2.8b_snowclones-*
- split: deduped_410m_snowclones
path: data/deduped_410m_snowclones-*
- split: duped_160m_snowclones
path: data/duped_160m_snowclones-*
- split: deduped_6.9b_snowclones
path: data/deduped_6.9b_snowclones-*
- split: deduped_70m_snowclones
path: data/deduped_70m_snowclones-*
- split: duped_1b_snowclones
path: data/duped_1b_snowclones-*
- split: duped_1.4b_snowclones
path: data/duped_1.4b_snowclones-*
- split: duped_70m_templates
path: data/duped_70m_templates-*
- split: duped_templates
path: data/duped_templates-*
- split: deduped_templates
path: data/deduped_templates-*
- split: deduped_12b_templates
path: data/deduped_12b_templates-*
- split: deduped_12b_snowclones
path: data/deduped_12b_snowclones-*
---
提供机构:
usvsnsp
原始信息汇总
数据集概述
特征
- 名称: index
- 数据类型: int64
- 名称: 0.9_frequencies
- 数据类型: int64
- 名称: 0.8_frequencies
- 数据类型: int64
数据分割
- 名称: duped_2.8b_snowclones
- 字节数: 40201848
- 样本数: 1675077
- 名称: duped_6.9b_templates
- 字节数: 50903256
- 样本数: 2120969
- 名称: deduped_6.9b_templates
- 字节数: 40327056
- 样本数: 1680294
- 名称: deduped_1.4b_templates
- 字节数: 25154328
- 样本数: 1048097
- 名称: deduped_snowclones
- 字节数: 120000000
- 样本数: 5000000
- 名称: duped_1b_templates
- 字节数: 30147384
- 样本数: 1256141
- 名称: duped_12b_templates
- 字节数: 57175824
- 样本数: 2382326
- 名称: deduped_160m_snowclones
- 字节数: 13948680
- 样本数: 581195
- 名称: deduped_1b_snowclones
- 字节数: 24788760
- 样本数: 1032865
- 名称: duped_70m_snowclones
- 字节数: 11134872
- 样本数: 463953
- 名称: deduped_1.4b_snowclones
- 字节数: 25154328
- 样本数: 1048097
- 名称: duped_1.4b_templates
- 字节数: 32969328
- 样本数: 1373722
- 名称: deduped_1b_templates
- 字节数: 24788760
- 样本数: 1032865
- 名称: deduped_2.8b_templates
- 字节数: 32525064
- 样本数: 1355211
- 名称: duped_2.8b_templates
- 字节数: 40201848
- 样本数: 1675077
- 名称: duped_6.9b_snowclones
- 字节数: 50903256
- 样本数: 2120969
- 名称: duped_410m_snowclones
- 字节数: 23288184
- 样本数: 970341
- 名称: deduped_410m_templates
- 字节数: 19464936
- 样本数: 811039
- 名称: duped_410m_templates
- 字节数: 23288184
- 样本数: 970341
- 名称: deduped_160m_templates
- 字节数: 13948680
- 样本数: 581195
- 名称: deduped_70m_templates
- 字节数: 9874752
- 样本数: 411448
- 名称: duped_160m_templates
- 字节数: 16552152
- 样本数: 689673
- 名称: duped_12b_snowclones
- 字节数: 57175824
- 样本数: 2382326
- 名称: duped_snowclones
- 字节数: 120000000
- 样本数: 5000000
- 名称: deduped_2.8b_snowclones
- 字节数: 32525064
- 样本数: 1355211
- 名称: deduped_410m_snowclones
- 字节数: 19464936
- 样本数: 811039
- 名称: duped_160m_snowclones
- 字节数: 16552152
- 样本数: 689673
- 名称: deduped_6.9b_snowclones
- 字节数: 40327056
- 样本数: 1680294
- 名称: deduped_70m_snowclones
- 字节数: 9874752
- 样本数: 411448
- 名称: duped_1b_snowclones
- 字节数: 30147384
- 样本数: 1256141
- 名称: duped_1.4b_snowclones
- 字节数: 32969328
- 样本数: 1373722
- 名称: duped_70m_templates
- 字节数: 11134872
- 样本数: 463953
- 名称: duped_templates
- 字节数: 120000000
- 样本数: 5000000
- 名称: deduped_templates
- 字节数: 120000000
- 样本数: 5000000
- 名称: deduped_12b_templates
- 字节数: 44909160
- 样本数: 1871215
- 名称: deduped_12b_snowclones
- 字节数: 44909160
- 样本数: 1871215
数据集大小
- 下载大小: 531300635 字节
- 数据集大小: 1516549488 字节
配置
- 配置名称: default
- 数据文件:
- 分割: duped_2.8b_snowclones
- 路径: data/duped_2.8b_snowclones-*
- 分割: duped_6.9b_templates
- 路径: data/duped_6.9b_templates-*
- 分割: deduped_6.9b_templates
- 路径: data/deduped_6.9b_templates-*
- 分割: deduped_1.4b_templates
- 路径: data/deduped_1.4b_templates-*
- 分割: deduped_snowclones
- 路径: data/deduped_snowclones-*
- 分割: duped_1b_templates
- 路径: data/duped_1b_templates-*
- 分割: duped_12b_templates
- 路径: data/duped_12b_templates-*
- 分割: deduped_160m_snowclones
- 路径: data/deduped_160m_snowclones-*
- 分割: deduped_1b_snowclones
- 路径: data/deduped_1b_snowclones-*
- 分割: duped_70m_snowclones
- 路径: data/duped_70m_snowclones-*
- 分割: deduped_1.4b_snowclones
- 路径: data/deduped_1.4b_snowclones-*
- 分割: duped_1.4b_templates
- 路径: data/duped_1.4b_templates-*
- 分割: deduped_1b_templates
- 路径: data/deduped_1b_templates-*
- 分割: deduped_2.8b_templates
- 路径: data/deduped_2.8b_templates-*
- 分割: duped_2.8b_templates
- 路径: data/duped_2.8b_templates-*
- 分割: duped_6.9b_snowclones
- 路径: data/duped_6.9b_snowclones-*
- 分割: duped_410m_snowclones
- 路径: data/duped_410m_snowclones-*
- 分割: deduped_410m_templates
- 路径: data/deduped_410m_templates-*
- 分割: duped_410m_templates
- 路径: data/duped_410m_templates-*
- 分割: deduped_160m_templates
- 路径: data/deduped_160m_templates-*
- 分割: deduped_70m_templates
- 路径: data/deduped_70m_templates-*
- 分割: duped_160m_templates
- 路径: data/duped_160m_templates-*
- 分割: duped_12b_snowclones
- 路径: data/duped_12b_snowclones-*
- 分割: duped_snowclones
- 路径: data/duped_snowclones-*
- 分割: deduped_2.8b_snowclones
- 路径: data/deduped_2.8b_snowclones-*
- 分割: deduped_410m_snowclones
- 路径: data/deduped_410m_snowclones-*
- 分割: duped_160m_snowclones
- 路径: data/duped_160m_snowclones-*
- 分割: deduped_6.9b_snowclones
- 路径: data/deduped_6.9b_snowclones-*
- 分割: deduped_70m_snowclones
- 路径: data/deduped_70m_snowclones-*
- 分割: duped_1b_snowclones
- 路径: data/duped_1b_snowclones-*
- 分割: duped_1.4b_snowclones
- 路径: data/duped_1.4b_snowclones-*
- 分割: duped_70m_templates
- 路径: data/duped_70m_templates-*
- 分割: duped_templates
- 路径: data/duped_templates-*
- 分割: deduped_templates
- 路径: data/deduped_templates-*
- 分割: deduped_12b_templates
- 路径: data/deduped_12b_templates-*
- 分割: deduped_12b_snowclones
- 路径: data/deduped_12b_snowclones-*
- 分割: duped_2.8b_snowclones
- 数据文件:



