five

VMware/open-instruct

收藏
Hugging Face2023-07-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/VMware/open-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: alpaca_prompt dtype: string - name: response dtype: string - name: instruction dtype: string - name: source dtype: string - name: task_name dtype: string - name: template_type dtype: string splits: - name: train num_bytes: 125656035 num_examples: 142622 download_size: 57912402 dataset_size: 125656035 license: cc-by-3.0 task_categories: - text-generation - conversational - text2text-generation language: - en pretty_name: T size_categories: - 100K<n<1M --- # Dataset Card for "open-instruct" This dataset is a combination of: 1. Filtered subset of [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) 2. train split of [Mosaic-dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) (consists of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)). 3. Filtered subset of [conceptofmind/cot_submix_original](https://huggingface.co/datasets/conceptofmind/cot_submix_original) ## Dataset The dataset consists of 6 columns: 1. instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in Mosaic-dolly-hhrlhf) 2. alpaca_prompt: Alpaca prompt template versions of instruction 3. response: The response to the instruction 4. source: Dataset source 5. task_name 6. template_type: flan template used (zeroshot or fewshot) ## License - It is usable for commercial purposes so long as you follow the terms of the license. ### Dataset subset licenses: - Open-instruct-v1-dolly-hhrlhf-oasst1 (Mosaic/Dolly-HHRLHF + filtered OASST1) - cc by 3.0 Subset of COT SUBMIX (FROM FLAN V2) Zeroshot examples: - ESNLI - MIT - ECQA - CDLA 1.0 - Sharing - Strategy - MIT - CREAK - MIT - gsmk8 - MIT - aqua - MIT - qasc - Apache 2.0 Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ - Copyright © Wikipedia editors and contributors. Databricks (https://www.databricks.com) - Copyright © Databricks Mosaic ML (https://www.mosaicml.com/) - Copyright © Mosaic ML VMware - Copyright © VMware [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
VMware
原始信息汇总

数据集概述

数据集名称

  • open-instruct

数据集组成

数据集特征

  • instruction: 自然语言指令,无任何提示模板
  • alpaca_prompt: 指令的 Alpaca 提示模板版本
  • response: 对指令的响应
  • source: 数据集来源
  • task_name
  • template_type: 使用的 flan 模板(zeroshot 或 fewshot)

数据集大小

  • 训练分割大小:125656035 字节
  • 训练分割示例数:142622
  • 下载大小:57912402 字节

许可证

  • 主许可证:cc-by-3.0
  • 子集许可证:
    • Open-instruct-v1-dolly-hhrlhf-oasst1: cc by 3.0
    • ESNLI: MIT
    • ECQA: CDLA 1.0 - Sharing
    • Strategy: MIT
    • CREAK: MIT
    • gsmk8: MIT
    • aqua: MIT
    • qasc: Apache 2.0

任务类别

  • text-generation
  • conversational
  • text2text-generation

语言

  • en

大小类别

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作