Razvan27/dataset_paper_forge2024
收藏Hugging Face2024-05-21 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Razvan27/dataset_paper_forge2024
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: CodeClippyBig
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 2454648741
num_examples: 2616788
download_size: 486097130
dataset_size: 2454648741
- config_name: CodeClippyComments
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 850
num_examples: 10
download_size: 1225
dataset_size: 850
- config_name: CodeClippySmall
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 11394470618
num_examples: 25149159
download_size: 1040910863
dataset_size: 11394470618
- config_name: CodeClippySmall4
features:
- name: comments
dtype: string
- name: original_comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 23015283667
num_examples: 25149159
download_size: 2087605375
dataset_size: 23015283667
- config_name: GithubCodeBig
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 1146535259
num_examples: 802583
download_size: 310420359
dataset_size: 1146535259
- config_name: GithubCodeBig3
features:
- name: comments
dtype: string
- name: original_comments
dtype: string
splits:
- name: train
num_bytes: 28059691580
num_examples: 802583
download_size: 591252394
dataset_size: 28059691580
- config_name: GithubCodeSmall
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 21277961444
num_examples: 45215655
download_size: 6291164625
dataset_size: 21277961444
- config_name: Test2
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 310
num_examples: 5
download_size: 1154
dataset_size: 310
- config_name: TheStack2Big
features:
- name: comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 2704660138
num_examples: 2010245
download_size: 622516824
dataset_size: 2704660138
- config_name: TheStack2Small
features:
- name: comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 35554187767
num_examples: 77436338
download_size: 9449610372
dataset_size: 35554187767
- config_name: TheStackBig
features:
- name: comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 16739
num_examples: 11
download_size: 6964
dataset_size: 16739
- config_name: TheStackBig4
features:
- name: comments
dtype: string
- name: original_comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 195228287251
num_examples: 2010245
download_size: 2485795225
dataset_size: 195228287251
- config_name: TheStackSmall
features:
- name: comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 122
num_examples: 2
download_size: 1519
dataset_size: 122
- config_name: TheStackSmall4
features:
- name: comments
dtype: string
- name: original_comments
dtype: string
- name: size
dtype: string
splits:
- name: train
num_bytes: 70411448492
num_examples: 77436338
download_size: 18882362463
dataset_size: 70411448492
- config_name: test
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 425
num_examples: 5
download_size: 1222
dataset_size: 425
- config_name: test2
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 310
num_examples: 5
download_size: 1154
dataset_size: 310
- config_name: test3
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 7804
num_examples: 1
download_size: 1152
dataset_size: 7804
- config_name: test4
features:
- name: comments
dtype: string
splits:
- name: train
num_bytes: 10443
num_examples: 3
download_size: 1630
dataset_size: 10443
configs:
- config_name: CodeClippyBig
data_files:
- split: train
path: data/CodeClippy_BigComments/train-*
- config_name: CodeClippyComments
data_files:
- split: train
path: data/CodeClippy_Comments/train-*
- config_name: CodeClippySmall
data_files:
- split: train
path: data/CodeClippy_SmallComments/train-*
- config_name: CodeClippySmall4
data_files:
- split: train
path: data/CodeClippy4_SmallComments/train-*
- config_name: GithubCodeBig
data_files:
- split: train
path: data/GithubCode_BigComments/train-*
- config_name: GithubCodeBig3
data_files:
- split: train
path: data/GithubCode3_BigComments/train-*
- config_name: GithubCodeSmall
data_files:
- split: train
path: data/GithubCode_SmallComments/train-*
- config_name: Test2
data_files:
- split: train
path: data/Test2_trial/train-*
- config_name: TheStack2Big
data_files:
- split: train
path: data/TheStack2_BigComments/train-*
- config_name: TheStack2Small
data_files:
- split: train
path: data/TheStack2_SmallComments/train-*
- config_name: TheStackBig
data_files:
- split: train
path: data/TheStack_BigComments/train-*
- config_name: TheStackBig4
data_files:
- split: train
path: data/TheStack4_BigComments/train-*
- config_name: TheStackSmall
data_files:
- split: train
path: data/TheStack_SmallComments/train-*
- config_name: TheStackSmall4
data_files:
- split: train
path: data/TheStack4_SmallComments/train-*
- config_name: test
data_files:
- split: train
path: data/RedPajama_Comments/train-*
- config_name: test2
data_files:
- split: train
path: data/Test/train-*
- config_name: test3
data_files:
- split: train
path: data/Test3/train-*
- config_name: test4
data_files:
- split: train
path: data/Test4/train-*
---
提供机构:
Razvan27
原始信息汇总
数据集概述
配置名称:CodeClippyBig
- 特征:
comments:类型为string
- 分割:
train:字节数为 2454648741,样本数为 2616788
- 下载大小:486097130 字节
- 数据集大小:2454648741 字节
配置名称:CodeClippyComments
- 特征:
comments:类型为string
- 分割:
train:字节数为 850,样本数为 10
- 下载大小:1225 字节
- 数据集大小:850 字节
配置名称:CodeClippySmall
- 特征:
comments:类型为string
- 分割:
train:字节数为 11394470618,样本数为 25149159
- 下载大小:1040910863 字节
- 数据集大小:11394470618 字节
配置名称:CodeClippySmall4
- 特征:
comments:类型为stringoriginal_comments:类型为stringsize:类型为string
- 分割:
train:字节数为 23015283667,样本数为 25149159
- 下载大小:2087605375 字节
- 数据集大小:23015283667 字节
配置名称:GithubCodeBig
- 特征:
comments:类型为string
- 分割:
train:字节数为 1146535259,样本数为 802583
- 下载大小:310420359 字节
- 数据集大小:1146535259 字节
配置名称:GithubCodeBig3
- 特征:
comments:类型为stringoriginal_comments:类型为string
- 分割:
train:字节数为 28059691580,样本数为 802583
- 下载大小:591252394 字节
- 数据集大小:28059691580 字节
配置名称:GithubCodeSmall
- 特征:
comments:类型为string
- 分割:
train:字节数为 21277961444,样本数为 45215655
- 下载大小:6291164625 字节
- 数据集大小:21277961444 字节
配置名称:Test2
- 特征:
comments:类型为string
- 分割:
train:字节数为 310,样本数为 5
- 下载大小:1154 字节
- 数据集大小:310 字节
配置名称:TheStack2Big
- 特征:
comments:类型为stringsize:类型为string
- 分割:
train:字节数为 2704660138,样本数为 2010245
- 下载大小:622516824 字节
- 数据集大小:2704660138 字节
配置名称:TheStack2Small
- 特征:
comments:类型为stringsize:类型为string
- 分割:
train:字节数为 35554187767,样本数为 77436338
- 下载大小:9449610372 字节
- 数据集大小:35554187767 字节
配置名称:TheStackBig
- 特征:
comments:类型为stringsize:类型为string
- 分割:
train:字节数为 16739,样本数为 11
- 下载大小:6964 字节
- 数据集大小:16739 字节
配置名称:TheStackBig4
- 特征:
comments:类型为stringoriginal_comments:类型为stringsize:类型为string
- 分割:
train:字节数为 195228287251,样本数为 2010245
- 下载大小:2485795225 字节
- 数据集大小:195228287251 字节
配置名称:TheStackSmall
- 特征:
comments:类型为stringsize:类型为string
- 分割:
train:字节数为 122,样本数为 2
- 下载大小:1519 字节
- 数据集大小:122 字节
配置名称:TheStackSmall4
- 特征:
comments:类型为stringoriginal_comments:类型为stringsize:类型为string
- 分割:
train:字节数为 70411448492,样本数为 77436338
- 下载大小:18882362463 字节
- 数据集大小:70411448492 字节
配置名称:test
- 特征:
comments:类型为string
- 分割:
train:字节数为 425,样本数为 5
- 下载大小:1222 字节
- 数据集大小:425 字节
配置名称:test2
- 特征:
comments:类型为string
- 分割:
train:字节数为 310,样本数为 5
- 下载大小:1154 字节
- 数据集大小:310 字节
配置名称:test3
- 特征:
comments:类型为string
- 分割:
train:字节数为 7804,样本数为 1
- 下载大小:1152 字节
- 数据集大小:7804 字节
配置名称:test4
- 特征:
comments:类型为string
- 分割:
train:字节数为 10443,样本数为 3
- 下载大小:1630 字节
- 数据集大小:10443 字节



