mosaicml/dolly_hhrlhf
收藏Hugging Face2023-10-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mosaicml/dolly_hhrlhf
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: response
dtype: string
splits:
- name: train
num_bytes: 43781455.002688624
num_examples: 59310
- name: test
num_bytes: 4479286.805304853
num_examples: 5129
download_size: 24882010
dataset_size: 48260741.80799348
license: cc-by-sa-3.0
task_categories:
- text-generation
language:
- en
pretty_name: Dolly HH-RLHF
---
# Dataset Card for "dolly_hhrlhf"
This dataset is a combination of [Databrick's dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). It also includes a test split, which was missing in the original `dolly` set. That test set is composed of 200 randomly selected samples from `dolly` + 4,929 of the test set samples from HH-RLHF which made it through the filtering process. The train set contains 59,310 samples; `15,014 - 200 = 14,814` from Dolly, and the remaining 44,496 from HH-RLHF.
It is slightly larger than Alpaca, and in our experience of slightly higher quality, but is usable for commercial purposes so long as you follow the terms of the license.
## Filtering process
As mentioned, the HH-RLHF data in this dataset is filtered. Specifically, we take the first turn of the convesation, then remove any samples where the assistant:
- uses the word "human", "thank", or "sorry"
- asks a question
- uses a first person pronoun
This leaves samples which look like instruction-following, as opposed to conversation.
## License/Attribution
<!--
**Copyright (2023) MosaicML, Inc.**
-->
This dataset was developed at MosaicML (https://www.mosaicml.com) and its use is subject to the CC BY-SA 3.0 license.
Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:
Wikipedia (various pages) - https://www.wikipedia.org/
Copyright © Wikipedia editors and contributors.
Databricks (https://www.databricks.com)
Copyright © Databricks
When citing this dataset, please use the following:
```
@misc{mosaicml2023dolly_hhrlhf,
author = {MosaicML},
title = {Dolly-HHRLHF Dataset},
year = {2023},
publisher = {HuggingFace Datasets},
howpublished = {https://huggingface.co/datasets/mosaicml/dolly_hhrlhf},
}
```
提供机构:
mosaicml
原始信息汇总
数据集概述
基本信息
- 名称: Dolly-HHRLHF
- 语言: 英语 (en)
- 任务类别: 文本生成 (text-generation)
- 许可证: CC BY-SA 3.0
数据集结构
- 特征:
prompt: 字符串类型response: 字符串类型
数据分割
- 训练集:
- 样本数量: 59,310
- 数据大小: 43,781,455.002688624 字节
- 测试集:
- 样本数量: 5,129
- 数据大小: 4,479,286.805304853 字节
数据集大小
- 下载大小: 24,882,010 字节
- 数据集总大小: 48,260,741.80799348 字节
数据来源与处理
- 数据集由Databricks dolly-15k和Anthropics HH-RLHF的过滤子集组合而成。
- 训练集包含14,814个样本来自Dolly,44,496个样本来自HH-RLHF。
- 测试集由200个随机选择的Dolly样本和4,929个通过过滤的HH-RLHF测试集样本组成。
过滤规则
- 移除包含"human", "thank", "sorry"的样本。
- 移除提问的样本。
- 移除使用第一人称代词的样本。



