five

codecomplete/base_dataset

收藏
Hugging Face2023-11-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/codecomplete/base_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is 10% repo sampled dataset for selected languages. We applied a repo sample rate of 10%. e.g. if sample rate is 10% then we take 10% of all repos for a given language but include all files inside the repo. This was generated using our codecomplete/training/completions/datagen ```bash ./launch.sh \ --dataset-name bigcode/starcoderdata \ --subset c,cpp,go,java,javascript,typescript,python,ruby,scala,sql \ --sample-rate 0.01 \ --hf-token <HF_TOKEN> \ --output-dir /home/${USER}/data \ --cache-dir /home/${USER}/hfcache \ --output-name c-cpp-go-java-javascript-typescript-python-ruby-scala-sql-0.01 \ --shuffle \ --build ``` **Create the repository** ```bash # Install git lfs to suport large files curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash sudo apt-get install git-lfs ``` ```bash # create the dataset repo huggingface-cli repo create <your_dataset_name> --type dataset --organization codecomplete ``` e.g. ```bash huggingface-cli repo create base_dataset --type dataset --organization codecomplete ``` **Clone the repository** ```bash git lfs install git clone https://huggingface.co/datasets/<your_organization_name>/<your_dataset_name> e.g. git clone https://huggingface.co/datasets/codecomplete/base_dataset ``` **Prepare your files** Create a descriptive README.md and check the dataset.json file ```bash cp /somewhere/base_dataset/*.json . git lfs track *.json git add .gitattributes git add *.json git add --all ``` **Upload your files** ```bash git status git commit -m "First version of the your_dataset_name dataset." git push ``` **Verify dataset** ```python from datasets import load_dataset dataset = load_dataset("codecomplete/<your_dataset_name>") print(dataset.num_rows) ```
提供机构:
codecomplete
原始信息汇总

数据集概述

数据集生成

  • 生成工具: 使用 codecomplete/training/completions/datagen 生成。
  • 命令示例: bash ./launch.sh --dataset-name bigcode/starcoderdata --subset c,cpp,go,java,javascript,typescript,python,ruby,scala,sql --sample-rate 0.01 --hf-token <HF_TOKEN> --output-dir /home/${USER}/data --cache-dir /home/${USER}/hfcache --output-name c-cpp-go-java-javascript-typescript-python-ruby-scala-sql-0.01 --shuffle --build

数据集创建与上传

  • 安装依赖: bash curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash sudo apt-get install git-lfs

  • 创建仓库: bash huggingface-cli repo create <your_dataset_name> --type dataset --organization codecomplete

    示例: bash huggingface-cli repo create base_dataset --type dataset --organization codecomplete

  • 克隆仓库: bash git lfs install git clone https://huggingface.co/datasets/<your_organization_name>/<your_dataset_name>

    示例: bash git clone https://huggingface.co/datasets/codecomplete/base_dataset

  • 准备文件: bash cp /somewhere/base_dataset/*.json . git lfs track *.json git add .gitattributes git add *.json git add --all

  • 上传文件: bash git status git commit -m "First version of the your_dataset_name dataset." git push

数据集验证

  • 验证方法: python from datasets import load_dataset dataset = load_dataset("codecomplete/<your_dataset_name>") print(dataset.num_rows)
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作