VMware/open-instruct
收藏Hugging Face2023-07-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/VMware/open-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: alpaca_prompt
dtype: string
- name: response
dtype: string
- name: instruction
dtype: string
- name: source
dtype: string
- name: task_name
dtype: string
- name: template_type
dtype: string
splits:
- name: train
num_bytes: 125656035
num_examples: 142622
download_size: 57912402
dataset_size: 125656035
license: cc-by-3.0
task_categories:
- text-generation
- conversational
- text2text-generation
language:
- en
pretty_name: T
size_categories:
- 100K<n<1M
---
# Dataset Card for "open-instruct"
This dataset is a combination of:
1. Filtered subset of [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
2. train split of [Mosaic-dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) (consists of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)).
3. Filtered subset of [conceptofmind/cot_submix_original](https://huggingface.co/datasets/conceptofmind/cot_submix_original)
## Dataset
The dataset consists of 6 columns:
1. instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in Mosaic-dolly-hhrlhf)
2. alpaca_prompt: Alpaca prompt template versions of instruction
3. response: The response to the instruction
4. source: Dataset source
5. task_name
6. template_type: flan template used (zeroshot or fewshot)
## License
- It is usable for commercial purposes so long as you follow the terms of the license.
### Dataset subset licenses:
- Open-instruct-v1-dolly-hhrlhf-oasst1 (Mosaic/Dolly-HHRLHF + filtered OASST1) - cc by 3.0
Subset of COT SUBMIX (FROM FLAN V2) Zeroshot examples:
- ESNLI - MIT
- ECQA - CDLA 1.0 - Sharing
- Strategy - MIT
- CREAK - MIT
- gsmk8 - MIT
- aqua - MIT
- qasc - Apache 2.0
Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:
Wikipedia (various pages) - https://www.wikipedia.org/
- Copyright © Wikipedia editors and contributors.
Databricks (https://www.databricks.com)
- Copyright © Databricks
Mosaic ML (https://www.mosaicml.com/)
- Copyright © Mosaic ML
VMware
- Copyright © VMware
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
VMware
原始信息汇总
数据集概述
数据集名称
- open-instruct
数据集组成
- 过滤后的 OpenAssistant/oasst1 子集
- Mosaic-dolly-hhrlhf 的训练分割,包含 Databricks dolly-15k 数据集和 Anthropics HH-RLHF 的过滤子集
- 过滤后的 conceptofmind/cot_submix_original 子集
数据集特征
- instruction: 自然语言指令,无任何提示模板
- alpaca_prompt: 指令的 Alpaca 提示模板版本
- response: 对指令的响应
- source: 数据集来源
- task_name
- template_type: 使用的 flan 模板(zeroshot 或 fewshot)
数据集大小
- 训练分割大小:125656035 字节
- 训练分割示例数:142622
- 下载大小:57912402 字节
许可证
- 主许可证:cc-by-3.0
- 子集许可证:
- Open-instruct-v1-dolly-hhrlhf-oasst1: cc by 3.0
- ESNLI: MIT
- ECQA: CDLA 1.0 - Sharing
- Strategy: MIT
- CREAK: MIT
- gsmk8: MIT
- aqua: MIT
- qasc: Apache 2.0
任务类别
- text-generation
- conversational
- text2text-generation
语言
- en
大小类别
- 100K<n<1M



