five

ewof/oasst-convo-unfiltered-deduped

收藏
Hugging Face2023-05-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ewof/oasst-convo-unfiltered-deduped
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is the Open Assistant dataset https://huggingface.co/datasets/OpenAssistant/oasst1, and formatted to be easier to convert to whatever finetune data format you want, removes 75 instances of alignment and 81 dupes. oasst_clean_format_dedupe.py was first ran on 2023-04-12_oasst_all.trees.jsonl from OpenAssistant/oasst1 Inspired by https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered Credit to anon8231489123 for the cleanup script that I adapted to wizardlm_clean.py, I then took this script and adapted it to oasst_clean_data_format.py I converted to trees so each object in the output has a messages list, each message has a role and content. Each starting prompt in the oasst dataset will appear in the output n times where n is the number of replies the prompt had (unless it had a duplicate). so if the oasst tree looked like ``` user: aaa assistant: bbb user: ccc assistant: ddd assistant: eee user: fff assistant: ggg user: hhh assistant: iii ``` then the output would look like ``` { messages: [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "bbb" }, { "role": "user", "content": "ccc" }, { "role": "model", "content": "ddd" } ] } { messages: [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "eee" } // if tree ends with a user msg then it is not included in the output, this can be changed by passing --save-user-ends to the script ] } { messages: [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "ggg" }, { "role": "user", "content": "hhh" }, { "role": "model", "content": "iii" } ] } ```
提供机构:
ewof
原始信息汇总

数据集概述

数据集名称

  • OpenAssistant/oasst1

数据集来源

数据处理

  • 数据集经过oasst_clean_format_dedupe.py脚本处理,移除了75个对齐实例和81个重复项。
  • 处理日期:2023-04-12

数据格式

  • 数据转换为树状结构,每个输出对象包含一个messages列表。
  • 每个消息包含rolecontent两个属性。
  • 原始数据集中的每个起始提示将根据其收到的回复次数(排除重复)在输出中出现相应次数。

示例输出

json { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "bbb" }, { "role": "user", "content": "ccc" }, { "role": "model", "content": "ddd" } ] } { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "eee" } ] } { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "ggg" }, { "role": "user", "content": "hhh" }, { "role": "model", "content": "iii" } ] }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作