five

recastai/openassistant-guanaco-chatml

收藏
Hugging Face2024-04-06 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/recastai/openassistant-guanaco-chatml
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: language dtype: string splits: - name: train num_bytes: 31236425.236542758 num_examples: 9829 download_size: 18142328 dataset_size: 31236425.236542758 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering - text2text-generation --- # Dataset Card for "openassistant-guanaco-chatml " ## Dataset Summary This dataset has been created by **Re:cast AI** to transform the existing dataset [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) into a [chatml](https://huggingface.co/docs/transformers/main/en/chat_templating) friendly format for use in SFT tasks with pretrained models. The following changes have been made: 1. All conversations end in the assistant response. 2. Each example has a corresponding 'language' category that corresponds to the language use in the example. ## Dataset Structure ```python Dataset({ features: ['text', 'messages', 'language'], num_rows: 9829 }) messages[ {'content': 'Can you write a short introduction about the relevance of... etc.', 'role': 'user'}, {'content': '"Monopsony" refers to a market structure where there is... etc.','role': 'assistant'} ] ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("recastai/openassistant-guanaco-chatml", split="train") ``` ## Modification Example of applying a custom system message of your choice for chatml training. ```python INSTRUCTIONS = ( "You are an expert AI assistant that helps users answer questions over a variety of topics. Some rules you always follow\n" "1. INSERT YOUR RULES HERE" ) def apply_system_message(example): example['messages'].insert(0, {'content': INSTRUCTIONS, 'role': 'system'}) return example dataset = dataset.map(apply_system_message) ```
提供机构:
recastai
原始信息汇总

数据集概述

数据集名称

  • openassistant-guanaco-chatml

数据集创建者

  • Re:cast AI

数据集目的

数据集结构

  • 特征:
    • text: 字符串类型
    • messages: 列表类型,包含
      • content: 字符串类型
      • role: 字符串类型
    • language: 字符串类型
  • 数据集大小:31236425.236542758字节
  • 下载大小:18142328字节
  • 训练集:
    • 示例数:9829
    • 数据量:31236425.236542758字节

数据集修改

  • 所有对话以助手响应结束。
  • 每个示例都有一个对应的language类别,表示示例中使用的语言。

数据集使用

  • 通过load_dataset函数加载数据集,指定分割为"train"。

数据集定制

  • 提供了一个示例函数apply_system_message,用于在对话开始前插入自定义的系统消息。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作