five

betteruncensored/VMware-open-instruct

收藏
Hugging Face2024-04-01 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/betteruncensored/VMware-open-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: alpaca_prompt dtype: string - name: response dtype: string - name: instruction dtype: string - name: source dtype: string - name: task_name dtype: string - name: template_type dtype: string splits: - name: train num_bytes: 125656035 num_examples: 142622 download_size: 57912402 dataset_size: 125656035 license: cc-by-3.0 task_categories: - text-generation - text2text-generation language: - en pretty_name: T size_categories: - 100K<n<1M --- # Dataset Card for "open-instruct" Better Uncensored This is the VMWare/open-instruct dataset processed with the Better Uncensored pipeline. A bit more than 4000 records were censored, less than 5%. Format kepts the same for compatibility. # Dataset Card for "open-instruct" This dataset is a combination of: 1. Filtered subset of [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) 2. train split of [Mosaic-dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) (consists of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)). 3. Filtered subset of [conceptofmind/cot_submix_original](https://huggingface.co/datasets/conceptofmind/cot_submix_original) ## Dataset The dataset consists of 6 columns: 1. instruction: The natural language instruction without any prompt templates (we extracted them out of the alpaca-format in Mosaic-dolly-hhrlhf) 2. alpaca_prompt: Alpaca prompt template versions of instruction 3. response: The response to the instruction 4. source: Dataset source 5. task_name 6. template_type: flan template used (zeroshot or fewshot) ## License - It is usable for commercial purposes so long as you follow the terms of the license. ### Dataset subset licenses: - Open-instruct-v1-dolly-hhrlhf-oasst1 (Mosaic/Dolly-HHRLHF + filtered OASST1) - cc by 3.0 Subset of COT SUBMIX (FROM FLAN V2) Zeroshot examples: - ESNLI - MIT - ECQA - CDLA 1.0 - Sharing - Strategy - MIT - CREAK - MIT - gsmk8 - MIT - aqua - MIT - qasc - Apache 2.0 Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ - Copyright © Wikipedia editors and contributors. Databricks (https://www.databricks.com) - Copyright © Databricks Mosaic ML (https://www.mosaicml.com/) - Copyright © Mosaic ML VMware - Copyright © VMware [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
betteruncensored
原始信息汇总

数据集概述

数据集名称

  • 名称: open-instruct

数据集特征

  • 特征列:
    • alpaca_prompt: 字符串类型
    • response: 字符串类型
    • instruction: 字符串类型
    • source: 字符串类型
    • task_name: 字符串类型
    • template_type: 字符串类型

数据集拆分

  • 训练集:
    • 大小: 125656035 字节
    • 示例数量: 142622

数据集大小

  • 下载大小: 57912402 字节
  • 数据集总大小: 125656035 字节

许可证

  • 许可证类型: cc-by-3.0

任务类别

  • text-generation
  • text2text-generation

语言

  • en

大小类别

  • 100K<n<1M

数据集组成

数据集列描述

  • 列描述:
    1. instruction: 自然语言指令,无任何提示模板
    2. alpaca_prompt: 指令的Alpaca提示模板版本
    3. response: 对指令的响应
    4. source: 数据集来源
    5. task_name
    6. template_type: 使用的flan模板类型 (zeroshot或fewshot)

许可证详情

  • 主要许可证: cc-by-3.0
  • 子集许可证:
    • Open-instruct-v1-dolly-hhrlhf-oasst1: cc by 3.0
    • 其他子集: 包括MIT, CDLA 1.0 - Sharing, Apache 2.0等
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作