five

Razvan27/dataset_paper_forge2024

收藏
Hugging Face2024-05-21 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Razvan27/dataset_paper_forge2024
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: CodeClippyBig features: - name: comments dtype: string splits: - name: train num_bytes: 2454648741 num_examples: 2616788 download_size: 486097130 dataset_size: 2454648741 - config_name: CodeClippyComments features: - name: comments dtype: string splits: - name: train num_bytes: 850 num_examples: 10 download_size: 1225 dataset_size: 850 - config_name: CodeClippySmall features: - name: comments dtype: string splits: - name: train num_bytes: 11394470618 num_examples: 25149159 download_size: 1040910863 dataset_size: 11394470618 - config_name: CodeClippySmall4 features: - name: comments dtype: string - name: original_comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 23015283667 num_examples: 25149159 download_size: 2087605375 dataset_size: 23015283667 - config_name: GithubCodeBig features: - name: comments dtype: string splits: - name: train num_bytes: 1146535259 num_examples: 802583 download_size: 310420359 dataset_size: 1146535259 - config_name: GithubCodeBig3 features: - name: comments dtype: string - name: original_comments dtype: string splits: - name: train num_bytes: 28059691580 num_examples: 802583 download_size: 591252394 dataset_size: 28059691580 - config_name: GithubCodeSmall features: - name: comments dtype: string splits: - name: train num_bytes: 21277961444 num_examples: 45215655 download_size: 6291164625 dataset_size: 21277961444 - config_name: Test2 features: - name: comments dtype: string splits: - name: train num_bytes: 310 num_examples: 5 download_size: 1154 dataset_size: 310 - config_name: TheStack2Big features: - name: comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 2704660138 num_examples: 2010245 download_size: 622516824 dataset_size: 2704660138 - config_name: TheStack2Small features: - name: comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 35554187767 num_examples: 77436338 download_size: 9449610372 dataset_size: 35554187767 - config_name: TheStackBig features: - name: comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 16739 num_examples: 11 download_size: 6964 dataset_size: 16739 - config_name: TheStackBig4 features: - name: comments dtype: string - name: original_comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 195228287251 num_examples: 2010245 download_size: 2485795225 dataset_size: 195228287251 - config_name: TheStackSmall features: - name: comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 122 num_examples: 2 download_size: 1519 dataset_size: 122 - config_name: TheStackSmall4 features: - name: comments dtype: string - name: original_comments dtype: string - name: size dtype: string splits: - name: train num_bytes: 70411448492 num_examples: 77436338 download_size: 18882362463 dataset_size: 70411448492 - config_name: test features: - name: comments dtype: string splits: - name: train num_bytes: 425 num_examples: 5 download_size: 1222 dataset_size: 425 - config_name: test2 features: - name: comments dtype: string splits: - name: train num_bytes: 310 num_examples: 5 download_size: 1154 dataset_size: 310 - config_name: test3 features: - name: comments dtype: string splits: - name: train num_bytes: 7804 num_examples: 1 download_size: 1152 dataset_size: 7804 - config_name: test4 features: - name: comments dtype: string splits: - name: train num_bytes: 10443 num_examples: 3 download_size: 1630 dataset_size: 10443 configs: - config_name: CodeClippyBig data_files: - split: train path: data/CodeClippy_BigComments/train-* - config_name: CodeClippyComments data_files: - split: train path: data/CodeClippy_Comments/train-* - config_name: CodeClippySmall data_files: - split: train path: data/CodeClippy_SmallComments/train-* - config_name: CodeClippySmall4 data_files: - split: train path: data/CodeClippy4_SmallComments/train-* - config_name: GithubCodeBig data_files: - split: train path: data/GithubCode_BigComments/train-* - config_name: GithubCodeBig3 data_files: - split: train path: data/GithubCode3_BigComments/train-* - config_name: GithubCodeSmall data_files: - split: train path: data/GithubCode_SmallComments/train-* - config_name: Test2 data_files: - split: train path: data/Test2_trial/train-* - config_name: TheStack2Big data_files: - split: train path: data/TheStack2_BigComments/train-* - config_name: TheStack2Small data_files: - split: train path: data/TheStack2_SmallComments/train-* - config_name: TheStackBig data_files: - split: train path: data/TheStack_BigComments/train-* - config_name: TheStackBig4 data_files: - split: train path: data/TheStack4_BigComments/train-* - config_name: TheStackSmall data_files: - split: train path: data/TheStack_SmallComments/train-* - config_name: TheStackSmall4 data_files: - split: train path: data/TheStack4_SmallComments/train-* - config_name: test data_files: - split: train path: data/RedPajama_Comments/train-* - config_name: test2 data_files: - split: train path: data/Test/train-* - config_name: test3 data_files: - split: train path: data/Test3/train-* - config_name: test4 data_files: - split: train path: data/Test4/train-* ---
提供机构:
Razvan27
原始信息汇总

数据集概述

配置名称:CodeClippyBig

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 2454648741,样本数为 2616788
  • 下载大小:486097130 字节
  • 数据集大小:2454648741 字节

配置名称:CodeClippyComments

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 850,样本数为 10
  • 下载大小:1225 字节
  • 数据集大小:850 字节

配置名称:CodeClippySmall

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 11394470618,样本数为 25149159
  • 下载大小:1040910863 字节
  • 数据集大小:11394470618 字节

配置名称:CodeClippySmall4

  • 特征
    • comments:类型为 string
    • original_comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 23015283667,样本数为 25149159
  • 下载大小:2087605375 字节
  • 数据集大小:23015283667 字节

配置名称:GithubCodeBig

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 1146535259,样本数为 802583
  • 下载大小:310420359 字节
  • 数据集大小:1146535259 字节

配置名称:GithubCodeBig3

  • 特征
    • comments:类型为 string
    • original_comments:类型为 string
  • 分割
    • train:字节数为 28059691580,样本数为 802583
  • 下载大小:591252394 字节
  • 数据集大小:28059691580 字节

配置名称:GithubCodeSmall

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 21277961444,样本数为 45215655
  • 下载大小:6291164625 字节
  • 数据集大小:21277961444 字节

配置名称:Test2

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 310,样本数为 5
  • 下载大小:1154 字节
  • 数据集大小:310 字节

配置名称:TheStack2Big

  • 特征
    • comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 2704660138,样本数为 2010245
  • 下载大小:622516824 字节
  • 数据集大小:2704660138 字节

配置名称:TheStack2Small

  • 特征
    • comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 35554187767,样本数为 77436338
  • 下载大小:9449610372 字节
  • 数据集大小:35554187767 字节

配置名称:TheStackBig

  • 特征
    • comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 16739,样本数为 11
  • 下载大小:6964 字节
  • 数据集大小:16739 字节

配置名称:TheStackBig4

  • 特征
    • comments:类型为 string
    • original_comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 195228287251,样本数为 2010245
  • 下载大小:2485795225 字节
  • 数据集大小:195228287251 字节

配置名称:TheStackSmall

  • 特征
    • comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 122,样本数为 2
  • 下载大小:1519 字节
  • 数据集大小:122 字节

配置名称:TheStackSmall4

  • 特征
    • comments:类型为 string
    • original_comments:类型为 string
    • size:类型为 string
  • 分割
    • train:字节数为 70411448492,样本数为 77436338
  • 下载大小:18882362463 字节
  • 数据集大小:70411448492 字节

配置名称:test

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 425,样本数为 5
  • 下载大小:1222 字节
  • 数据集大小:425 字节

配置名称:test2

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 310,样本数为 5
  • 下载大小:1154 字节
  • 数据集大小:310 字节

配置名称:test3

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 7804,样本数为 1
  • 下载大小:1152 字节
  • 数据集大小:7804 字节

配置名称:test4

  • 特征
    • comments:类型为 string
  • 分割
    • train:字节数为 10443,样本数为 3
  • 下载大小:1630 字节
  • 数据集大小:10443 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作