oasst1
收藏魔搭社区2026-04-28 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/oasst1
下载链接
链接失效反馈官方服务:
资源简介:
# OpenAssistant Conversations Dataset (OASST1)
## Dataset Description
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** https://arxiv.org/abs/2304.07327
### Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292
quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus
is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Please refer to our [paper](https://arxiv.org/abs/2304.07327) for further details.
### Dataset Structure
This dataset contains message trees. Each message tree has an initial prompt message as the root node,
which can have multiple child messages as replies, and these child messages can have multiple replies.
All messages have a role property: this can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".
This version of the dataset contains data collected on the [open-assistant.io](https://open-assistant.io/) website until April 12 2023.
### JSON Example: Message
For readability, the following JSON examples are shown formatted with indentation on multiple lines.
Objects are stored without indentation (on single lines) in the actual jsonl files.
```json
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
}
```
### JSON Example: Conversation Tree
For readability, only a subset of the message properties is shown here.
```json
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
// ...
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
// ...
]
}
]
}
]
}
}
```
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
details about the data structure and Python code to read and write jsonl files containing oasst data objects.
If you would like to explore the dataset yourself you can find a
[`getting-started`](https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/openassistant-oasst1/getting-started.ipynb)
notebook in the `notebooks/openassistant-oasst1` folder of the [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
github repository.
## Main Dataset Files
Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
or as a flat list (table) of messages (extension `.messages.jsonl.gz`).
### Ready For Export Trees
```
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
```
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
### All Trees
```
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
```
All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt),
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
### Supplemental Exports: Spam & Prompts
```
2023-04-12_oasst_spam.messages.jsonl.gz
```
These are messages which were deleted or have a negative review result (`"review_result": false`).
Besides low quality, a frequent reason for message deletion is a wrong language tag.
```
2023-04-12_oasst_prompts.messages.jsonl.gz
```
These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
### Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits.
These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).
To load the oasst1 train & validation splits use:
```python
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
```
The messages appear in depth-first order of the message trees.
Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.
### Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages**
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
<details>
<summary><b>Languages with under 1000 messages</b></summary>
<ul>
<li>Vietnamese: 952</li>
<li>Basque: 947</li>
<li>Polish: 886</li>
<li>Hungarian: 811</li>
<li>Arabic: 666</li>
<li>Dutch: 628</li>
<li>Swedish: 512</li>
<li>Turkish: 454</li>
<li>Finnish: 386</li>
<li>Czech: 372</li>
<li>Danish: 358</li>
<li>Galician: 339</li>
<li>Hebrew: 255</li>
<li>Romanian: 200</li>
<li>Norwegian Bokmål: 133</li>
<li>Indonesian: 115</li>
<li>Bulgarian: 95</li>
<li>Bengali: 82</li>
<li>Persian: 72</li>
<li>Greek: 66</li>
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
</details>
## Contact
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
- E-Mail: [open-assistant@laion.ai](mailto:open-assistant@laion.ai)
# OpenAssistant对话数据集(OASST1)
## 数据集说明
- **主页:** https://www.open-assistant.io/
- **代码仓库:** https://github.com/LAION-AI/Open-Assistant
- **论文:** https://arxiv.org/abs/2304.07327
### 数据集概览
为推动大规模对齐研究的民主化,我们发布了OpenAssistant对话数据集(OASST1)。该数据集为人工生成、人工标注的助手式对话语料库,包含35种语言的161443条消息,附带461292条质量评分,最终形成超过10000个经过完整标注的对话树。本语料库由全球超过13500名志愿者参与的众包项目共同打造。
更多细节请参阅我们的[论文](https://arxiv.org/abs/2304.07327)。
### 数据集结构
本数据集包含对话树结构。每一棵对话树均以初始提示消息作为根节点,根节点可拥有多条作为回复的子消息,而这些子消息同样可拥有多条回复。所有消息均包含`role`属性,取值可为`assistant`(助手)或`prompter`(提问者)。从初始提示到叶节点的对话线程中,角色会严格在`prompter`与`assistant`之间交替切换。本版本数据集收录了截至2023年4月12日在[open-assistant.io](https://open-assistant.io/)网站上收集的全部数据。
### 消息JSON示例
为便于阅读,以下JSON示例采用多行缩进格式展示;实际的jsonl文件中,对象将以无缩进的单行形式存储。
json
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
}
### 对话树JSON示例
为便于阅读,此处仅展示了消息属性的子集。
json
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
// ...
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
// ...
]
}
]
}
]
}
}
如需了解数据结构以及读写包含oasst数据对象的jsonl文件的Python代码,请参阅[oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data)仓库。
若您希望自行探索该数据集,可在[LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant) GitHub仓库的`notebooks/openassistant-oasst1`文件夹中找到[`getting-started`](https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/openassistant-oasst1/getting-started.ipynb)入门笔记本。
## 主要数据集文件
对话数据以两种格式提供:一种是树状嵌套消息格式(文件扩展名`.trees.jsonl.gz`),另一种是扁平化的消息列表格式(文件扩展名`.messages.jsonl.gz`)。
### 待导出就绪对话树
2023-04-12_oasst_ready.trees.jsonl.gz 共10364棵对话树,包含88838条消息
2023-04-12_oasst_ready.messages.jsonl.gz 共88838条消息
该文件包含状态为`ready_for_export`的对话树,且未包含垃圾信息与已删除消息,同时附带消息标签。`oasst_ready-trees`文件通常足以用于监督微调(Supervised Fine-Tuning, SFT)与奖励模型(Reward Model, RM)的训练。
### 全部对话树
2023-04-12_oasst_all.trees.jsonl.gz 共66497棵对话树,包含161443条消息
2023-04-12_oasst_all.messages.jsonl.gz 共161443条消息
包含所有对话树,其中包括状态为`prompt_lottery_waiting`(仅包含单条消息即初始提示的对话树)、`aborted_low_grade`(因消息质量较低而停止生长的对话树)以及`halted_by_moderator`(被审核员中止的对话树)的数据集。
### 补充导出文件:垃圾信息与提示消息
2023-04-12_oasst_spam.messages.jsonl.gz
该文件包含已删除的消息或审核结果为负(`"review_result": false`)的消息。除质量低下外,消息被删除的常见原因还包括语言标签错误。
2023-04-12_oasst_prompts.messages.jsonl.gz
该文件包含所有状态为`ready_for_export`或`prompt_lottery_waiting`的对话树中,审核结果为正(无垃圾信息)且被保留的初始提示消息。
### 使用Huggingface Datasets
尽管Huggingface Datasets非常适合结构化表格数据集,但并不适配OpenAssistant对话树这类嵌套数据结构。尽管如此,我们仍将可从`2023-04-12_oasst_ready.trees.jsonl.gz`文件中获取的所有消息以Parquet格式提供,并划分为训练集与验证集子集,可直接通过[Huggingface Datasets](https://pypi.org/project/datasets/)加载。
若要加载oasst1的训练集与验证集子集,请使用如下代码:
python
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
消息将按照对话树的深度优先顺序排列。
可通过扁平化消息表中的`parent_id`与`message_id`属性识别消息间的父子关系,从而重构完整的对话树。`message_tree_id`与`tree_state`属性(仅在扁平化消息文件中存在)可用于检索某一对话树的所有消息,或按状态筛选对话树。
### 语言分布
OpenAssistant对话数据集涵盖35种不同语言,消息分布如下:
**消息数量超过1000的语言**
- 英语:71956
- 西班牙语:43061
- 俄语:9089
- 德语:5279
- 中文:4962
- 法语:4251
- 泰语:3042
- 葡萄牙语(巴西):2969
- 加泰罗尼亚语:2260
- 韩语:1553
- 乌克兰语:1352
- 意大利语:1320
- 日语:1018
<details>
<summary><b>消息数量不足1000的语言</b></summary>
<ul>
<li>越南语:952</li>
<li>巴斯克语:947</li>
<li>波兰语:886</li>
<li>匈牙利语:811</li>
<li>阿拉伯语:666</li>
<li>荷兰语:628</li>
<li>瑞典语:512</li>
<li>土耳其语:454</li>
<li>芬兰语:386</li>
<li>捷克语:372</li>
<li>丹麦语:358</li>
<li>加利西亚语:339</li>
<li>希伯来语:255</li>
<li>罗马尼亚语:200</li>
<li>挪威博克马尔语:133</li>
<li>印度尼西亚语:115</li>
<li>保加利亚语:95</li>
<li>孟加拉语:82</li>
<li>波斯语:72</li>
<li>希腊语:66</li>
<li>世界语:59</li>
<li>斯洛伐克语:19</li>
</ul>
</details>
## 联系方式
- Discord:[Open Assistant Discord服务器](https://ykilcher.com/open-assistant-discord)
- GitHub:[LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
- 电子邮箱:[open-assistant@laion.ai](mailto:open-assistant@laion.ai)
提供机构:
maas
创建时间:
2024-06-06



