sablo/oasst2_curated
收藏Hugging Face2024-01-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sablo/oasst2_curated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 9014169
num_examples: 4693
- name: test
num_bytes: 479119
num_examples: 247
download_size: 5127472
dataset_size: 9493288
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# Open Assistant 2 Top English Curated
## Dataset Details
### Dataset Description
A filtered and curated dataset taken from the top scoring https://huggingface.co/datasets/OpenAssistant/oasst2 conversations. Saved in HF Chat format. The result is a high quality dataset for SFT.
- **Created by:** [dctanner](https://huggingface.co/dctanner) and the team at [Sablo AI](https://sablo.ai)
- **License:** Apache 2.0
## Dataset Structure
We structure the dataset using the format commonly used as input into [Hugging Face Chat Templates](https://huggingface.co/docs/transformers/chat_templating):
```
[
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}
]
```
## Dataset Creation
### Source Data
- **Source Dataset:** https://huggingface.co/datasets/OpenAssistant/oasst2
#### Data Collection and Processing
We started with the top_k=1 English only conversations from https://huggingface.co/datasets/OpenAssistant/oasst2.
Filtering and curation was done to remove conversations with:
- Duplicate or very similar responses
- Responses where the AI was actually responding like a person (present in this dataset as the responses are created by humans pretending to be an AI, and no everyone followed these instructions closely)
- Profanity or inappropriate responses for an AI
- Very short response lengths (often below 50 or 200 characters)
- URLs
# License
- **License:** Apache 2.0
This dataset is usable for commercial purposes.
# Contact
Created by [dctanner](https://huggingface.co/dctanner) and the team at [Sablo AI](https://sablo.ai)
提供机构:
sablo
原始信息汇总
Open Assistant 2 Top English Curated 数据集详情
数据集描述
该数据集是从 OpenAssistant/oasst2 数据集中筛选和精炼的高分英语对话,保存为 HF Chat 格式。适用于监督微调(SFT)。
数据集结构
数据集采用 Hugging Face Chat Templates 常用的输入格式:
json [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "Im doing great. How can I help you today?"} ]
数据集创建
源数据
- 源数据集: OpenAssistant/oasst2
数据收集和处理
从 OpenAssistant/oasst2 中选取了 top_k=1 的英语对话,并进行了以下筛选和精炼:
- 去除重复或非常相似的回复
- 去除 AI 实际以人类方式回复的对话(由于回复是由假装成 AI 的人类创建的,并非所有人都严格遵循这些指令)
- 去除包含粗俗或不当回复的对话
- 去除非常短的回复(通常低于 50 或 200 个字符)
- 去除包含 URL 的回复
许可证
- 许可证: Apache 2.0
该数据集可用于商业用途。



