Isotonic/open-instruct-v1
收藏Hugging Face2023-08-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Isotonic/open-instruct-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 693502500.8465096
num_examples: 399050
- name: test
num_bytes: 173376494.1534904
num_examples: 99763
download_size: 369952246
dataset_size: 866878995.0
task_categories:
- text-generation
- conversational
language:
- en
size_categories:
- 100K<n<1M
---
# Dataset Card for "open-instruct-v1"
Open Instruct V1 is an amalgamation of different datasets which are cleaned and then collated into a singular format for training.
Uses Stability AI's System Prompt.
```
### System: StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
```
## Dataset Breakdown
| Dataset | Amount of Samples |
|----------------|-------------------|
| [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 51759 |
| [Self Instruct](https://github.com/yizhongw/self-instruct) | 82599 |
| [GPT-4 Instruct](https://github.com/teknium1/GPTeacher) | 18194 |
| [Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) | 18019 |
| [Dolly](https://huggingface.co/datasets/HuggingFaceH4/databricks_dolly_15k) | 15015 |
| [Synthetic](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) | 33143 |
| [Roleplay](https://github.com/teknium1/GPTeacher) | 3146 |
| [asss](https://huggingface.co/datasets/HuggingFaceH4/asss) | 448 |
| [instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) | 327 |
| [Human assistant deduped](https://huggingface.co/datasets/Isotonic/human_assistant_conversation_deduped) | 209350
| Total | 432000 |
提供机构:
Isotonic
原始信息汇总
数据集概述
数据集名称
- 名称: open-instruct-v1
数据集特征
- 特征:
instruction: 字符串类型input: 字符串类型output: 字符串类型text: 字符串类型
数据集划分
- 训练集:
- 样本数量: 399050
- 存储大小: 693502500.8465096字节
- 测试集:
- 样本数量: 99763
- 存储大小: 173376494.1534904字节
数据集大小
- 下载大小: 369952246字节
- 总数据集大小: 866878995.0字节
任务类别
- 类别:
- 文本生成
- 对话
语言
- 语言: 英语
数据集规模
- 规模: 100K<n<1M
数据集来源
- 来源:
- Alpaca: 51759样本
- Self Instruct: 82599样本
- GPT-4 Instruct: 18194样本
- Code Alpaca: 18019样本
- Dolly: 15015样本
- Synthetic: 33143样本
- Roleplay: 3146样本
- aass: 448样本
- instruction-dataset: 327样本
- Human assistant deduped: 209350样本
- 总样本数量: 432000



