blastwind/random_code_snippets
收藏Hugging Face2024-03-17 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/blastwind/random_code_snippets
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: lang
dtype: string
- name: seed
dtype: string
splits:
- name: train
num_bytes: 3114466
num_examples: 10000
download_size: 1629429
dataset_size: 3114466
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
This dataset contains 10000 random snippets of 5-15 lines parsed from [`bigcode/starcoderdata`](https://huggingface.co/datasets/bigcode/starcoderdata).
Specifically, I consider 10 languages: Haskell, Python, cpp, java, typescript, shell, csharp, rust, php, and swift. And, I collect 1000 documents for each language, and then extract 5-15 random lines from the document to create this dataset.
See MagiCoder and their [seed collection](https://github.com/ise-uiuc/magicoder/blob/main/experiments/collect_seed_documents.py#L35) process. In my usecase, I needed some inspiration documents for generating synthetic datasets.
提供机构:
blastwind
原始信息汇总
数据集概述
数据集特征
- lang:数据类型为字符串
- seed:数据类型为字符串
数据集划分
- 训练集(train):
- 示例数量:10000
- 数据大小:3114466字节
数据集大小
- 下载大小:1629429字节
- 数据集总大小:3114466字节
配置信息
- 配置名称:default
- 数据文件:
- 划分:训练集
- 路径:data/train-*



