recastai/openassistant-guanaco-chatml
收藏Hugging Face2024-04-06 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/recastai/openassistant-guanaco-chatml
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: language
dtype: string
splits:
- name: train
num_bytes: 31236425.236542758
num_examples: 9829
download_size: 18142328
dataset_size: 31236425.236542758
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- question-answering
- text2text-generation
---
# Dataset Card for "openassistant-guanaco-chatml "
## Dataset Summary
This dataset has been created by **Re:cast AI** to transform the existing dataset [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) into a [chatml](https://huggingface.co/docs/transformers/main/en/chat_templating) friendly format for use in SFT tasks with pretrained models.
The following changes have been made:
1. All conversations end in the assistant response.
2. Each example has a corresponding 'language' category that corresponds to the language use in the example.
## Dataset Structure
```python
Dataset({
features: ['text', 'messages', 'language'],
num_rows: 9829
})
messages[
{'content': 'Can you write a short introduction about the relevance of... etc.', 'role': 'user'},
{'content': '"Monopsony" refers to a market structure where there is... etc.','role': 'assistant'}
]
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("recastai/openassistant-guanaco-chatml", split="train")
```
## Modification
Example of applying a custom system message of your choice for chatml training.
```python
INSTRUCTIONS = (
"You are an expert AI assistant that helps users answer questions over a variety of topics. Some rules you always follow\n"
"1. INSERT YOUR RULES HERE"
)
def apply_system_message(example):
example['messages'].insert(0, {'content': INSTRUCTIONS, 'role': 'system'})
return example
dataset = dataset.map(apply_system_message)
```
提供机构:
recastai
原始信息汇总
数据集概述
数据集名称
- openassistant-guanaco-chatml
数据集创建者
- Re:cast AI
数据集目的
- 将现有的数据集openassistant-guanaco转换为chatml友好格式,用于SFT任务中预训练模型的使用。
数据集结构
- 特征:
- text: 字符串类型
- messages: 列表类型,包含
- content: 字符串类型
- role: 字符串类型
- language: 字符串类型
- 数据集大小:31236425.236542758字节
- 下载大小:18142328字节
- 训练集:
- 示例数:9829
- 数据量:31236425.236542758字节
数据集修改
- 所有对话以助手响应结束。
- 每个示例都有一个对应的language类别,表示示例中使用的语言。
数据集使用
- 通过
load_dataset函数加载数据集,指定分割为"train"。
数据集定制
- 提供了一个示例函数
apply_system_message,用于在对话开始前插入自定义的系统消息。



