betteruncensored/VMware-open-instruct

Name: betteruncensored/VMware-open-instruct
Creator: betteruncensored
Published: 2024-04-01 22:49:37
License: 暂无描述

Hugging Face2024-04-01 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/betteruncensored/VMware-open-instruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: alpaca_prompt dtype: string - name: response dtype: string - name: instruction dtype: string - name: source dtype: string - name: task_name dtype: string - name: template_type dtype: string splits: - name: train num_bytes: 125656035 num_examples: 142622 download_size: 57912402 dataset_size: 125656035 license: cc-by-3.0 task_categories: - text-generation - text2text-generation language: - en pretty_name: T size_categories: - 100K<n<1M --- # Dataset Card for "open-instruct" Better Uncensored This is the VMWare/open-instruct dataset processed with the Better Uncensored pipeline. A bit more than 4000 records were censored, less than 5%. Format kepts the same for compatibility. # Dataset Card for "open-instruct" This dataset is a combination of: 1. Filtered subset of [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) 2. train split of [Mosaic-dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) (consists of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)). 3. Filtered subset of [conceptofmind/cot_submix_original](https://huggingface.co/datasets/conceptofmind/cot_submix_original) ## Dataset The dataset consists of 6 columns: 1. instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in Mosaic-dolly-hhrlhf) 2. alpaca_prompt: Alpaca prompt template versions of instruction 3. response: The response to the instruction 4. source: Dataset source 5. task_name 6. template_type: flan template used (zeroshot or fewshot) ## License - It is usable for commercial purposes so long as you follow the terms of the license. ### Dataset subset licenses: - Open-instruct-v1-dolly-hhrlhf-oasst1 (Mosaic/Dolly-HHRLHF + filtered OASST1) - cc by 3.0 Subset of COT SUBMIX (FROM FLAN V2) Zeroshot examples: - ESNLI - MIT - ECQA - CDLA 1.0 - Sharing - Strategy - MIT - CREAK - MIT - gsmk8 - MIT - aqua - MIT - qasc - Apache 2.0 Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ - Copyright © Wikipedia editors and contributors. Databricks (https://www.databricks.com) - Copyright © Databricks Mosaic ML (https://www.mosaicml.com/) - Copyright © Mosaic ML VMware - Copyright © VMware [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

betteruncensored

原始信息汇总

数据集概述

数据集名称

名称: open-instruct

数据集特征

特征列:
- alpaca_prompt: 字符串类型
- response: 字符串类型
- instruction: 字符串类型
- source: 字符串类型
- task_name: 字符串类型
- template_type: 字符串类型

数据集拆分

训练集:
- 大小: 125656035 字节
- 示例数量: 142622

数据集大小

下载大小: 57912402 字节
数据集总大小: 125656035 字节

许可证

许可证类型: cc-by-3.0

任务类别

text-generation
text2text-generation

语言

大小类别

100K<n<1M

数据集组成

由以下部分组合而成:
1. 过滤后的 OpenAssistant/oasst1 子集
2. Mosaic-dolly-hhrlhf 的训练集 (包含 Databricks dolly-15k 数据集和 Anthropics HH-RLHF 的过滤子集)
3. 过滤后的 conceptofmind/cot_submix_original 子集

数据集列描述

列描述:
1. instruction: 自然语言指令，无任何提示模板
2. alpaca_prompt: 指令的Alpaca提示模板版本
3. response: 对指令的响应
4. source: 数据集来源
5. task_name
6. template_type: 使用的flan模板类型 (zeroshot或fewshot)

许可证详情

主要许可证: cc-by-3.0
子集许可证:
- Open-instruct-v1-dolly-hhrlhf-oasst1: cc by 3.0
- 其他子集: 包括MIT, CDLA 1.0 - Sharing, Apache 2.0等

5,000+

优质数据集

54 个

任务类型

进入经典数据集