Isotonic/open-instruct-v1

Name: Isotonic/open-instruct-v1
Creator: Isotonic
Published: 2023-08-31 07:33:25
License: 暂无描述

Hugging Face2023-08-31 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Isotonic/open-instruct-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: text dtype: string splits: - name: train num_bytes: 693502500.8465096 num_examples: 399050 - name: test num_bytes: 173376494.1534904 num_examples: 99763 download_size: 369952246 dataset_size: 866878995.0 task_categories: - text-generation - conversational language: - en size_categories: - 100K<n<1M --- # Dataset Card for "open-instruct-v1" Open Instruct V1 is an amalgamation of different datasets which are cleaned and then collated into a singular format for training. Uses Stability AI's System Prompt. ``` ### System: StableLM Tuned (Alpha version) - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI. - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user. - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes. - StableLM will refuse to participate in anything that could harm a human. ``` ## Dataset Breakdown | Dataset | Amount of Samples | |----------------|-------------------| | [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 51759 | | [Self Instruct](https://github.com/yizhongw/self-instruct) | 82599 | | [GPT-4 Instruct](https://github.com/teknium1/GPTeacher) | 18194 | | [Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) | 18019 | | [Dolly](https://huggingface.co/datasets/HuggingFaceH4/databricks_dolly_15k) | 15015 | | [Synthetic](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) | 33143 | | [Roleplay](https://github.com/teknium1/GPTeacher) | 3146 | | [asss](https://huggingface.co/datasets/HuggingFaceH4/asss) | 448 | | [instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) | 327 | | [Human assistant deduped](https://huggingface.co/datasets/Isotonic/human_assistant_conversation_deduped) | 209350 | Total | 432000 |

提供机构：

Isotonic

原始信息汇总

数据集概述

数据集名称

名称: open-instruct-v1

数据集特征

特征:
- instruction: 字符串类型
- input: 字符串类型
- output: 字符串类型
- text: 字符串类型

数据集划分

训练集:
- 样本数量: 399050
- 存储大小: 693502500.8465096字节
测试集:
- 样本数量: 99763
- 存储大小: 173376494.1534904字节

数据集大小

下载大小: 369952246字节
总数据集大小: 866878995.0字节

任务类别

类别:
- 文本生成
- 对话

语言

语言: 英语

数据集规模

规模: 100K<n<1M

数据集来源

来源:
- Alpaca: 51759样本
- Self Instruct: 82599样本
- GPT-4 Instruct: 18194样本
- Code Alpaca: 18019样本
- Dolly: 15015样本
- Synthetic: 33143样本
- Roleplay: 3146样本
- aass: 448样本
- instruction-dataset: 327样本
- Human assistant deduped: 209350样本
总样本数量: 432000

5,000+

优质数据集

54 个

任务类型

进入经典数据集