ewof/oasst-convo-unfiltered-deduped
收藏Hugging Face2023-05-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ewof/oasst-convo-unfiltered-deduped
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is the Open Assistant dataset https://huggingface.co/datasets/OpenAssistant/oasst1, and formatted to be easier to convert to whatever finetune data format you want, removes 75 instances of alignment and 81 dupes.
oasst_clean_format_dedupe.py was first ran on 2023-04-12_oasst_all.trees.jsonl from OpenAssistant/oasst1
Inspired by https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered
Credit to anon8231489123 for the cleanup script that I adapted to wizardlm_clean.py, I then took this script and adapted it to oasst_clean_data_format.py
I converted to trees so each object in the output has a messages list, each message has a role and content. Each starting prompt in the oasst dataset will appear in the output n times where n is the number of replies the prompt had (unless it had a duplicate).
so if the oasst tree looked like
```
user: aaa
assistant: bbb
user: ccc
assistant: ddd
assistant: eee
user: fff
assistant: ggg
user: hhh
assistant: iii
```
then the output would look like
```
{
messages: [
{ "role": "user", "content": "aaa" },
{ "role": "model", "content": "bbb" },
{ "role": "user", "content": "ccc" },
{ "role": "model", "content": "ddd" }
]
}
{
messages: [
{ "role": "user", "content": "aaa" },
{ "role": "model", "content": "eee" }
// if tree ends with a user msg then it is not included in the output, this can be changed by passing --save-user-ends to the script
]
}
{
messages: [
{ "role": "user", "content": "aaa" },
{ "role": "model", "content": "ggg" },
{ "role": "user", "content": "hhh" },
{ "role": "model", "content": "iii" }
]
}
```
提供机构:
ewof
原始信息汇总
数据集概述
数据集名称
- OpenAssistant/oasst1
数据集来源
- 原始数据集链接:OpenAssistant/oasst1
数据处理
- 数据集经过
oasst_clean_format_dedupe.py脚本处理,移除了75个对齐实例和81个重复项。 - 处理日期:2023-04-12
数据格式
- 数据转换为树状结构,每个输出对象包含一个
messages列表。 - 每个消息包含
role和content两个属性。 - 原始数据集中的每个起始提示将根据其收到的回复次数(排除重复)在输出中出现相应次数。
示例输出
json { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "bbb" }, { "role": "user", "content": "ccc" }, { "role": "model", "content": "ddd" } ] } { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "eee" } ] } { "messages": [ { "role": "user", "content": "aaa" }, { "role": "model", "content": "ggg" }, { "role": "user", "content": "hhh" }, { "role": "model", "content": "iii" } ] }



