five

manu/code_5p_data_separate

收藏
Hugging Face2023-09-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/manu/code_5p_data_separate
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: StarcoderdataPythonTrain path: data/StarcoderdataPythonTrain-* - split: StarcoderdataPythonTest path: data/StarcoderdataPythonTest-* - split: StarcoderdataMarkdownTrain path: data/StarcoderdataMarkdownTrain-* - split: StarcoderdataMarkdownTest path: data/StarcoderdataMarkdownTest-* - split: StarcoderdataJupyterScriptsDedupFilteredTrain path: data/StarcoderdataJupyterScriptsDedupFilteredTrain-* - split: StarcoderdataJupyterScriptsDedupFilteredTest path: data/StarcoderdataJupyterScriptsDedupFilteredTest-* - split: StarcoderdataJupyterStructuredCleanDedupTrain path: data/StarcoderdataJupyterStructuredCleanDedupTrain-* - split: StarcoderdataJupyterStructuredCleanDedupTest path: data/StarcoderdataJupyterStructuredCleanDedupTest-* - split: StarcoderdataJsonTrain path: data/StarcoderdataJsonTrain-* - split: StarcoderdataJsonTest path: data/StarcoderdataJsonTest-* - split: CodeContestsTrain path: data/CodeContestsTrain-* - split: CodeContestsTest path: data/CodeContestsTest-* - split: PypiCleanTrain path: data/PypiCleanTrain-* - split: PypiCleanTest path: data/PypiCleanTest-* dataset_info: features: - name: id dtype: string - name: text dtype: string - name: dataset_id dtype: string splits: - name: StarcoderdataPythonTrain num_bytes: 3077290405 num_examples: 643232 - name: StarcoderdataPythonTest num_bytes: 546326 num_examples: 100 - name: StarcoderdataMarkdownTrain num_bytes: 4054448273 num_examples: 1051364 - name: StarcoderdataMarkdownTest num_bytes: 680799 num_examples: 100 - name: StarcoderdataJupyterScriptsDedupFilteredTrain num_bytes: 401590417 num_examples: 45626 - name: StarcoderdataJupyterScriptsDedupFilteredTest num_bytes: 724111 num_examples: 100 - name: StarcoderdataJupyterStructuredCleanDedupTrain num_bytes: 316718609 num_examples: 33337 - name: StarcoderdataJupyterStructuredCleanDedupTest num_bytes: 971655 num_examples: 100 - name: StarcoderdataJsonTrain num_bytes: 291208312 num_examples: 237477 - name: StarcoderdataJsonTest num_bytes: 112941 num_examples: 100 - name: CodeContestsTrain num_bytes: 151487748 num_examples: 78717 - name: CodeContestsTest num_bytes: 79396 num_examples: 42 - name: PypiCleanTrain num_bytes: 1549670299 num_examples: 121809 - name: PypiCleanTest num_bytes: 1718599 num_examples: 100 download_size: 4213817063 dataset_size: 9847247890 --- # Dataset Card for "code_5p_data_separate" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
manu
原始信息汇总

数据集概述

配置信息

  • 默认配置
    • 数据文件
      • StarcoderdataPythonTrain:路径为 data/StarcoderdataPythonTrain-*
      • StarcoderdataPythonTest:路径为 data/StarcoderdataPythonTest-*
      • StarcoderdataMarkdownTrain:路径为 data/StarcoderdataMarkdownTrain-*
      • StarcoderdataMarkdownTest:路径为 data/StarcoderdataMarkdownTest-*
      • StarcoderdataJupyterScriptsDedupFilteredTrain:路径为 data/StarcoderdataJupyterScriptsDedupFilteredTrain-*
      • StarcoderdataJupyterScriptsDedupFilteredTest:路径为 data/StarcoderdataJupyterScriptsDedupFilteredTest-*
      • StarcoderdataJupyterStructuredCleanDedupTrain:路径为 data/StarcoderdataJupyterStructuredCleanDedupTrain-*
      • StarcoderdataJupyterStructuredCleanDedupTest:路径为 data/StarcoderdataJupyterStructuredCleanDedupTest-*
      • StarcoderdataJsonTrain:路径为 data/StarcoderdataJsonTrain-*
      • StarcoderdataJsonTest:路径为 data/StarcoderdataJsonTest-*
      • CodeContestsTrain:路径为 data/CodeContestsTrain-*
      • CodeContestsTest:路径为 data/CodeContestsTest-*
      • PypiCleanTrain:路径为 data/PypiCleanTrain-*
      • PypiCleanTest:路径为 data/PypiCleanTest-*

数据集信息

  • 特征

    • id:类型为字符串
    • text:类型为字符串
    • dataset_id:类型为字符串
  • 分割

    • StarcoderdataPythonTrain:字节数为 3077290405,示例数为 643232
    • StarcoderdataPythonTest:字节数为 546326,示例数为 100
    • StarcoderdataMarkdownTrain:字节数为 4054448273,示例数为 1051364
    • StarcoderdataMarkdownTest:字节数为 680799,示例数为 100
    • StarcoderdataJupyterScriptsDedupFilteredTrain:字节数为 401590417,示例数为 45626
    • StarcoderdataJupyterScriptsDedupFilteredTest:字节数为 724111,示例数为 100
    • StarcoderdataJupyterStructuredCleanDedupTrain:字节数为 316718609,示例数为 33337
    • StarcoderdataJupyterStructuredCleanDedupTest:字节数为 971655,示例数为 100
    • StarcoderdataJsonTrain:字节数为 291208312,示例数为 237477
    • StarcoderdataJsonTest:字节数为 112941,示例数为 100
    • CodeContestsTrain:字节数为 151487748,示例数为 78717
    • CodeContestsTest:字节数为 79396,示例数为 42
    • PypiCleanTrain:字节数为 1549670299,示例数为 121809
    • PypiCleanTest:字节数为 1718599,示例数为 100
  • 下载大小:4213817063 字节

  • 数据集大小:9847247890 字节

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作