betteruncensored/VMware-open-instruct
收藏Hugging Face2024-04-01 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/betteruncensored/VMware-open-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: alpaca_prompt
dtype: string
- name: response
dtype: string
- name: instruction
dtype: string
- name: source
dtype: string
- name: task_name
dtype: string
- name: template_type
dtype: string
splits:
- name: train
num_bytes: 125656035
num_examples: 142622
download_size: 57912402
dataset_size: 125656035
license: cc-by-3.0
task_categories:
- text-generation
- text2text-generation
language:
- en
pretty_name: T
size_categories:
- 100K<n<1M
---
# Dataset Card for "open-instruct" Better Uncensored
This is the VMWare/open-instruct dataset processed with the Better Uncensored pipeline.
A bit more than 4000 records were censored, less than 5%. Format kepts the same for compatibility.
# Dataset Card for "open-instruct"
This dataset is a combination of:
1. Filtered subset of [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
2. train split of [Mosaic-dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) (consists of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)).
3. Filtered subset of [conceptofmind/cot_submix_original](https://huggingface.co/datasets/conceptofmind/cot_submix_original)
## Dataset
The dataset consists of 6 columns:
1. instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in Mosaic-dolly-hhrlhf)
2. alpaca_prompt: Alpaca prompt template versions of instruction
3. response: The response to the instruction
4. source: Dataset source
5. task_name
6. template_type: flan template used (zeroshot or fewshot)
## License
- It is usable for commercial purposes so long as you follow the terms of the license.
### Dataset subset licenses:
- Open-instruct-v1-dolly-hhrlhf-oasst1 (Mosaic/Dolly-HHRLHF + filtered OASST1) - cc by 3.0
Subset of COT SUBMIX (FROM FLAN V2) Zeroshot examples:
- ESNLI - MIT
- ECQA - CDLA 1.0 - Sharing
- Strategy - MIT
- CREAK - MIT
- gsmk8 - MIT
- aqua - MIT
- qasc - Apache 2.0
Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:
Wikipedia (various pages) - https://www.wikipedia.org/
- Copyright © Wikipedia editors and contributors.
Databricks (https://www.databricks.com)
- Copyright © Databricks
Mosaic ML (https://www.mosaicml.com/)
- Copyright © Mosaic ML
VMware
- Copyright © VMware
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
betteruncensored
原始信息汇总
数据集概述
数据集名称
- 名称: open-instruct
数据集特征
- 特征列:
- alpaca_prompt: 字符串类型
- response: 字符串类型
- instruction: 字符串类型
- source: 字符串类型
- task_name: 字符串类型
- template_type: 字符串类型
数据集拆分
- 训练集:
- 大小: 125656035 字节
- 示例数量: 142622
数据集大小
- 下载大小: 57912402 字节
- 数据集总大小: 125656035 字节
许可证
- 许可证类型: cc-by-3.0
任务类别
- text-generation
- text2text-generation
语言
- en
大小类别
- 100K<n<1M
数据集组成
- 由以下部分组合而成:
- 过滤后的 OpenAssistant/oasst1 子集
- Mosaic-dolly-hhrlhf 的训练集 (包含 Databricks dolly-15k 数据集和 Anthropics HH-RLHF 的过滤子集)
- 过滤后的 conceptofmind/cot_submix_original 子集
数据集列描述
- 列描述:
- instruction: 自然语言指令,无任何提示模板
- alpaca_prompt: 指令的Alpaca提示模板版本
- response: 对指令的响应
- source: 数据集来源
- task_name
- template_type: 使用的flan模板类型 (zeroshot或fewshot)
许可证详情
- 主要许可证: cc-by-3.0
- 子集许可证:
- Open-instruct-v1-dolly-hhrlhf-oasst1: cc by 3.0
- 其他子集: 包括MIT, CDLA 1.0 - Sharing, Apache 2.0等



