five

Isotonic/open-instruct-v1

收藏
Hugging Face2023-08-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Isotonic/open-instruct-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: text dtype: string splits: - name: train num_bytes: 693502500.8465096 num_examples: 399050 - name: test num_bytes: 173376494.1534904 num_examples: 99763 download_size: 369952246 dataset_size: 866878995.0 task_categories: - text-generation - conversational language: - en size_categories: - 100K<n<1M --- # Dataset Card for "open-instruct-v1" Open Instruct V1 is an amalgamation of different datasets which are cleaned and then collated into a singular format for training. Uses Stability AI's System Prompt. ``` ### System: StableLM Tuned (Alpha version) - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI. - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user. - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes. - StableLM will refuse to participate in anything that could harm a human. ``` ## Dataset Breakdown | Dataset | Amount of Samples | |----------------|-------------------| | [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 51759 | | [Self Instruct](https://github.com/yizhongw/self-instruct) | 82599 | | [GPT-4 Instruct](https://github.com/teknium1/GPTeacher) | 18194 | | [Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) | 18019 | | [Dolly](https://huggingface.co/datasets/HuggingFaceH4/databricks_dolly_15k) | 15015 | | [Synthetic](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) | 33143 | | [Roleplay](https://github.com/teknium1/GPTeacher) | 3146 | | [asss](https://huggingface.co/datasets/HuggingFaceH4/asss) | 448 | | [instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) | 327 | | [Human assistant deduped](https://huggingface.co/datasets/Isotonic/human_assistant_conversation_deduped) | 209350 | Total | 432000 |
提供机构:
Isotonic
原始信息汇总

数据集概述

数据集名称

  • 名称: open-instruct-v1

数据集特征

  • 特征:
    • instruction: 字符串类型
    • input: 字符串类型
    • output: 字符串类型
    • text: 字符串类型

数据集划分

  • 训练集:
    • 样本数量: 399050
    • 存储大小: 693502500.8465096字节
  • 测试集:
    • 样本数量: 99763
    • 存储大小: 173376494.1534904字节

数据集大小

  • 下载大小: 369952246字节
  • 总数据集大小: 866878995.0字节

任务类别

  • 类别:
    • 文本生成
    • 对话

语言

  • 语言: 英语

数据集规模

  • 规模: 100K<n<1M

数据集来源

  • 来源:
    • Alpaca: 51759样本
    • Self Instruct: 82599样本
    • GPT-4 Instruct: 18194样本
    • Code Alpaca: 18019样本
    • Dolly: 15015样本
    • Synthetic: 33143样本
    • Roleplay: 3146样本
    • aass: 448样本
    • instruction-dataset: 327样本
    • Human assistant deduped: 209350样本
  • 总样本数量: 432000
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作