Felladrin/ChatML-webGPT_x_dolly

Name: Felladrin/ChatML-webGPT_x_dolly
Creator: Felladrin
Published: 2024-02-24 10:39:52
License: 暂无描述

Hugging Face2024-02-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Felladrin/ChatML-webGPT_x_dolly

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-sa-3.0 size_categories: - 10K<n<100K task_categories: - question-answering --- [starfishmedical/webGPT_x_dolly](https://huggingface.co/datasets/starfishmedical/webGPT_x_dolly) in ChatML format, ready to use in [HuggingFace TRL's SFT Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer). Python code used for conversion: ```python from datasets import load_dataset from transformers import AutoTokenizer import random tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1") dataset = load_dataset("starfishmedical/webGPT_x_dolly", split="train") def format(columns): instruction = columns["instruction"].strip() input = columns["input"].strip() assistant_message = columns["output"].strip() if random.random() < 0.5: user_message = f"Question:\n{instruction}\n\nContext:\n{input}" else: user_message = f"Context:\n{input}\n\nQuestion:\n{instruction}" messages = [ { "role": "user", "content": user_message, }, { "role": "assistant", "content": assistant_message, }, ] return { "text": tokenizer.apply_chat_template(messages, tokenize=False) } dataset.map(format).select_columns(['text']).to_parquet("train.parquet") ```

提供机构：

Felladrin

原始信息汇总

数据集概述

基本信息

语言: 英语
许可: CC BY-SA 3.0
大小类别: 10K<n<100K
任务类别: 问答

数据集格式

数据集以ChatML格式提供，适用于HuggingFace TRL的SFT Trainer。

数据转换代码

使用Python代码将数据集转换为所需格式。
代码包括加载数据集、使用AutoTokenizer进行预处理、以及格式化数据为ChatML格式。
数据集中的列包括instruction、input和output，分别表示指令、输入和输出。
代码随机排列指令和输入的顺序，并将数据转换为ChatML格式的消息。
最终数据保存为Parquet格式文件。

5,000+

优质数据集

54 个

任务类型

进入经典数据集