five

sablo/oasst2_curated

收藏
Hugging Face2024-01-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sablo/oasst2_curated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 9014169 num_examples: 4693 - name: test num_bytes: 479119 num_examples: 247 download_size: 5127472 dataset_size: 9493288 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Open Assistant 2 Top English Curated ## Dataset Details ### Dataset Description A filtered and curated dataset taken from the top scoring https://huggingface.co/datasets/OpenAssistant/oasst2 conversations. Saved in HF Chat format. The result is a high quality dataset for SFT. - **Created by:** [dctanner](https://huggingface.co/dctanner) and the team at [Sablo AI](https://sablo.ai) - **License:** Apache 2.0 ## Dataset Structure We structure the dataset using the format commonly used as input into [Hugging Face Chat Templates](https://huggingface.co/docs/transformers/chat_templating): ``` [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"} ] ``` ## Dataset Creation ### Source Data - **Source Dataset:** https://huggingface.co/datasets/OpenAssistant/oasst2 #### Data Collection and Processing We started with the top_k=1 English only conversations from https://huggingface.co/datasets/OpenAssistant/oasst2. Filtering and curation was done to remove conversations with: - Duplicate or very similar responses - Responses where the AI was actually responding like a person (present in this dataset as the responses are created by humans pretending to be an AI, and no everyone followed these instructions closely) - Profanity or inappropriate responses for an AI - Very short response lengths (often below 50 or 200 characters) - URLs # License - **License:** Apache 2.0 This dataset is usable for commercial purposes. # Contact Created by [dctanner](https://huggingface.co/dctanner) and the team at [Sablo AI](https://sablo.ai)
提供机构:
sablo
原始信息汇总

Open Assistant 2 Top English Curated 数据集详情

数据集描述

该数据集是从 OpenAssistant/oasst2 数据集中筛选和精炼的高分英语对话,保存为 HF Chat 格式。适用于监督微调(SFT)。

数据集结构

数据集采用 Hugging Face Chat Templates 常用的输入格式:

json [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "Im doing great. How can I help you today?"} ]

数据集创建

源数据

数据收集和处理

OpenAssistant/oasst2 中选取了 top_k=1 的英语对话,并进行了以下筛选和精炼:

  • 去除重复或非常相似的回复
  • 去除 AI 实际以人类方式回复的对话(由于回复是由假装成 AI 的人类创建的,并非所有人都严格遵循这些指令)
  • 去除包含粗俗或不当回复的对话
  • 去除非常短的回复(通常低于 50 或 200 个字符)
  • 去除包含 URL 的回复

许可证

  • 许可证: Apache 2.0

该数据集可用于商业用途。

联系

dctannerSablo AI 团队创建。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作