lucasmccabe-lmi/gpt4all_code
收藏Hugging Face2023-04-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lucasmccabe-lmi/gpt4all_code
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 294812377.0
num_examples: 93257
download_size: 143503343
dataset_size: 294812377.0
---
# Dataset Card for "gpt4all_code"
We provide a code-related subset of the original [nomic-ai/gpt4all-j-prompt-generations](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations#dataset-card-for-gpt4all-j-prompt-generations) (v1.2-jazzy revision) dataset, which represents those records whose prompts were sourced from [pacovaldez/stackoverflow-questions](https://huggingface.co/datasets/pacovaldez/stackoverflow-questions) and who explicitly mention one of Python, Java, C++, SQL, Kotlin, PHP, Swift, MATLAB, Typescript, Scala, HTML, CSS, Rust, or Perl. Output records are responses from OpenAI’s GPT3.5-Turbo. Prompt/response pairs have been reformatted to fit the Alpaca format.
Numbers:
- **Prompts**: 93257
- **Tokens**: 87686551 using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer (counting instruction+input+output)
提供机构:
lucasmccabe-lmi
原始信息汇总
数据集概述
数据集名称
- 名称: gpt4all_code
数据集内容
- 特征:
- instruction: 字符串类型
- input: 字符串类型
- output: 字符串类型
数据集划分
- 训练集:
- 示例数量: 93257
- 数据大小: 294812377.0 字节
数据集大小
- 下载大小: 143503343 字节
- 总数据大小: 294812377.0 字节
数据集详情
- 来源: 原始数据集为 nomic-ai/gpt4all-j-prompt-generations 的子集,特定于提及编程语言的记录。
- 编程语言: Python, Java, C++, SQL, Kotlin, PHP, Swift, MATLAB, Typescript, Scala, HTML, CSS, Rust, Perl
- 输出记录: OpenAI’s GPT3.5-Turbo 的响应
- 格式: Alpaca 格式
数据集统计
- 提示数量: 93257
- 令牌数量: 87686551 (使用 EleutherAI/gpt-neox-20b 令牌器计算)



