Trelis/openassistant-guanaco-EOS
收藏Hugging Face2023-10-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Trelis/openassistant-guanaco-EOS
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- es
- ru
- de
- pl
- th
- vi
- sv
- bn
- da
- he
- it
- fa
- sk
- id
- nb
- el
- nl
- hu
- eu
- zh
- eo
- ja
- ca
- cs
- bg
- fi
- pt
- tr
- ro
- ar
- uk
- gl
- fr
- ko
tags:
- human-feedback
- llama-2
size_categories:
- 1K<n<10k
pretty_name: Filtered OpenAssistant Conversations
---
# Chat Fine-tuning Dataset - Guanaco Style
This dataset allows for fine-tuning chat models using "### Human:" AND "### Assistant" as the beginning and end of sequence tokens.
Preparation:
1. The dataset is cloned from [TimDettmers](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), which itself is a subset of the Open Assistant dataset, which you can find [here](https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main). This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.
1. The dataset was then slightly adjusted to:
- if a row of data ends with an assistant response, then "### Human" was additionally added to the end of that row of data.
Details of the root dataset follow, copied from that repo:
# OpenAssistant Conversations Dataset (OASST1)
## Dataset Description
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** https://arxiv.org/abs/2304.07327
### Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292
quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus
is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Please refer to our [paper](https://arxiv.org/abs/2304.07327) for further details.
### Dataset Structure
This dataset contains message trees. Each message tree has an initial prompt message as the root node,
which can have multiple child messages as replies, and these child messages can have multiple replies.
All messages have a role property: this can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".
This version of the dataset contains data collected on the [open-assistant.io](https://open-assistant.io/) website until April 12 2023.
### JSON Example: Message
For readability, the following JSON examples are shown formatted with indentation on multiple lines.
Objects are stored without indentation (on single lines) in the actual jsonl files.
```json
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
}
```
### JSON Example: Conversation Tree
For readability, only a subset of the message properties is shown here.
```json
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
// ...
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
// ...
]
}
]
}
]
}
}
```
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
details about the data structure and Python code to read and write jsonl files containing oasst data objects.
If you would like to explore the dataset yourself you can find a
[`getting-started`](https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/openassistant-oasst1/getting-started.ipynb)
notebook in the `notebooks/openassistant-oasst1` folder of the [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
github repository.
## Main Dataset Files
Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
or as a flat list (table) of messages (extension `.messages.jsonl.gz`).
### Ready For Export Trees
```
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
```
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
### All Trees
```
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
```
All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt),
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
### Supplemental Exports: Spam & Prompts
```
2023-04-12_oasst_spam.messages.jsonl.gz
```
These are messages which were deleted or have a negative review result (`"review_result": false`).
Besides low quality, a frequent reason for message deletion is a wrong language tag.
```
2023-04-12_oasst_prompts.messages.jsonl.gz
```
These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
### Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits.
These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).
To load the oasst1 train & validation splits use:
```python
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
```
The messages appear in depth-first order of the message trees.
Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.
### Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages**
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
<details>
<summary><b>Languages with under 1000 messages</b></summary>
<ul>
<li>Vietnamese: 952</li>
<li>Basque: 947</li>
<li>Polish: 886</li>
<li>Hungarian: 811</li>
<li>Arabic: 666</li>
<li>Dutch: 628</li>
<li>Swedish: 512</li>
<li>Turkish: 454</li>
<li>Finnish: 386</li>
<li>Czech: 372</li>
<li>Danish: 358</li>
<li>Galician: 339</li>
<li>Hebrew: 255</li>
<li>Romanian: 200</li>
<li>Norwegian Bokmål: 133</li>
<li>Indonesian: 115</li>
<li>Bulgarian: 95</li>
<li>Bengali: 82</li>
<li>Persian: 72</li>
<li>Greek: 66</li>
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
</details>
## Contact
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
- E-Mail: [open-assistant@laion.ai](mailto:open-assistant@laion.ai)
提供机构:
Trelis
原始信息汇总
数据集概述
基本信息
- 许可证: Apache 2.0
- 支持语言: 35种语言,包括英语、西班牙语、俄语、德语、中文等。
- 标签: human-feedback, llama-2
- 数据量: 1K<n<10k
- 名称: Filtered OpenAssistant Conversations
数据集描述
- 来源: 该数据集是从TimDettmers克隆的,是Open Assistant数据集的一个子集。
- 筛选条件: 仅包含对话树中最高评分的路径,共有9,846个样本。
- 调整: 如果一行数据以助手响应结束,则在末尾额外添加"### Human"。
数据结构
- 消息树: 每个消息树以初始提示消息为根节点,可以有多个子消息作为回复,这些子消息也可以有多个回复。
- 角色: 消息的角色属性可以是"assistant"或"prompter",对话线程中的角色严格交替。
数据文件
- 格式: 提供嵌套消息树(
.trees.jsonl.gz)和平面消息列表(.messages.jsonl.gz)。 - 主要文件:
2023-04-12_oasst_ready.trees.jsonl.gz: 10,364棵树,包含88,838条消息。2023-04-12_oasst_ready.messages.jsonl.gz: 88,838条消息。2023-04-12_oasst_all.trees.jsonl.gz: 66,497棵树,包含161,443条消息。2023-04-12_oasst_all.messages.jsonl.gz: 161,443条消息。
使用方法
- Huggingface Datasets: 可以直接加载数据集的训练和验证拆分。 python from datasets import load_dataset ds = load_dataset("OpenAssistant/oasst1") train = ds[train] # len(train)=84437 (95%) val = ds[validation] # len(val)=4401 (5%)
语言分布
- 主要语言: 英语、西班牙语、俄语、德语、中文等。
- 其他语言: 包括越南语、巴斯克语、波兰语等。
搜集汇总
数据集介绍

构建方式
在对话生成模型的研究领域,数据质量对模型性能具有决定性影响。Trelis/openassistant-guanaco-EOS数据集源自OpenAssistant Conversations(OASST1)这一大规模多语言对话语料库,该语料库通过全球超过13,500名志愿者的众包努力构建,包含161,443条消息和461,292条质量标注。本数据集在此基础上,选取了对话树中评分最高的路径,共包含9,846个样本,并进行了适应性调整:对于以助手回复结尾的数据行,在其末尾添加了“### Human”标记,以适配使用特定序列标记进行对话模型微调的需求。
使用方法
该数据集专为对话模型的监督式微调而设计。用户可通过Hugging Face Datasets库直接加载,利用“### Human:”和“### Assistant:”作为序列的开始与结束标记来格式化对话样本,以适配如LLaMA-2等大语言模型的指令微调流程。数据以扁平化的消息列表形式提供,每条消息包含message_id、parent_id、text、role等关键字段,便于通过父子关系重建完整的对话树结构。研究人员可依据语言、质量评分等属性筛选数据,或利用其多轮对话结构进行上下文学习、奖励模型训练等多种自然语言处理任务的实验。
背景与挑战
背景概述
在大型语言模型对齐研究日益重要的背景下,Trelis/openassistant-guanaco-EOS数据集应运而生,其根源可追溯至2023年发布的OpenAssistant Conversations (OASST1)项目。该项目由LAION-AI等机构主导,汇聚了全球超过13,500名志愿者的众包努力,旨在构建一个高质量、多语言的人类与助手风格对话语料库,以民主化大规模对齐研究。该数据集的核心研究问题聚焦于如何通过人类反馈来优化对话模型的训练,从而提升模型在多样化任务中的有用性、诚实性与无害性。其广泛的语言覆盖与精细的质量标注,为后续的监督微调与奖励模型训练奠定了坚实基础,显著推动了开放领域对话系统的发展。
当前挑战
该数据集致力于解决对话模型对齐中的核心挑战,即如何确保模型生成既符合人类价值观又具备多样性与创造性的回应。具体而言,挑战体现在高质量对话数据的稀缺性、多语言语境下文化敏感性的把握,以及众包标注中主观评价标准的一致性维护。在构建过程中,团队面临了大规模众包协作的复杂性,包括对话树结构的有效管理、跨语言数据质量的均衡控制,以及海量消息的精细化标注与去噪。此外,从原始OASST1数据中筛选出最高评级的对话路径并适配特定微调格式,亦需克服数据一致性与结构转换的技术难题。
常用场景
经典使用场景
在自然语言处理领域,对话模型的微调是提升其交互能力的关键环节。Trelis/openassistant-guanaco-EOS数据集以其精心筛选的高质量对话路径,为研究者提供了标准化的训练素材。该数据集采用“### Human:”和“### Assistant:”作为序列标记,构建了清晰的对话结构,使得模型能够学习到人类与助手之间自然、连贯的交流模式。这一设计特别适用于监督式微调场景,帮助模型在多样化语言环境中生成符合人类期望的响应。
解决学术问题
该数据集致力于解决大语言模型对齐研究中的核心挑战,即如何让模型输出与人类价值观和意图保持一致。通过提供包含多语言、人工生成与标注的对话语料,它缓解了高质量对齐数据稀缺的困境。数据集中的质量评级与丰富标签,为研究社区探索基于人类反馈的强化学习、奖励模型构建以及对话安全性评估提供了坚实基础,推动了对齐技术的民主化进程与可复现性研究。
实际应用
在实际部署中,基于该数据集微调的模型能够广泛应用于智能客服、虚拟助手、教育辅导以及多语言内容生成等场景。其涵盖35种语言的特性,尤其有利于开发具备跨文化沟通能力的对话系统。企业可利用此类模型提升用户服务体验,自动处理咨询与支持任务;教育机构则可借助其构建互动学习工具,为学生提供个性化的答疑与辅导服务。
数据集最近研究
最新研究方向
在自然语言处理领域,多语言对话模型的优化与对齐研究正成为前沿焦点。Trelis/openassistant-guanaco-EOS数据集作为OpenAssistant Conversations的高质量子集,其最新研究聚焦于利用人类反馈强化学习技术,提升模型在多语言环境下的对话生成能力与安全性。当前热点事件如Llama-2等开源大模型的兴起,推动了该数据集在指令微调与对齐策略中的广泛应用,旨在解决模型偏见控制、跨语言泛化及伦理合规等挑战。这些研究不仅促进了开放科学协作生态的发展,也为构建更可靠、包容的人工智能助手奠定了数据基础,具有深远的学术与工程意义。
以上内容由遇见数据集搜集并总结生成



