中文高质量大模型多轮对话SFT数据集

Name: 中文高质量大模型多轮对话SFT数据集
Creator: 晴数智慧
License: 暂无描述

OpenDataLab2024-05-30 收录

下载链接：

https://opendatalab.com/MagicHub/LLM-SFT-Dataset

下载链接

链接失效反馈

官方服务：

更多采购需求

资源简介：

该数据集来源于晴数智慧LLM多领域超自然SFT多轮对话文本数据集。该数据集包含97184轮中文自然对话句子，涉及【家庭生活、教育医疗、军事战争、科学技术、气候环境、人文科学、商业经济、数码产品、体育竞技、休闲娱乐、衣食住行、艺术美术、政治法律、职业发展、宗教信仰】15个主题。领域覆盖多样，也可以单独抽取相关领域的数据进行领域SFT。本次开源的部分数据，由来自中国的644名不同ID的采集人独家贡献，北京晴数智慧科技有限公司进行授权采集。每组对话由两位采集人围绕一个主题展开，上下文对话与当前的内容逻辑相关。适用于训练大模型多轮对话 (back and forth conversation)、上下文逻辑推理能力，以及端到端对话大模型。

This dataset is derived from the Qingshu Wisdom LLM multi-domain unscripted natural multi-turn dialogue supervised fine-tuning (SFT) text dataset. It contains 97,184 rounds of natural Chinese conversational sentences, covering 15 themes: Family Life, Education and Healthcare, Military Affairs and Warfare, Science and Technology, Climate and Environment, Humanities, Business and Economics, Digital Products, Sports Competitions, Leisure and Entertainment, Daily Life (Clothing, Food, Housing and Transportation), Art and Fine Arts, Politics and Law, Career Development, and Religious Beliefs. With diverse domain coverage, the dataset also allows for the extraction of domain-specific data for domain-level supervised fine-tuning. The partially open-sourced data was exclusively contributed by 644 collectors with unique IDs from China, and the collection work was authorized by Beijing Qingshu Wisdom Technology Co., Ltd. Each dialogue session is developed by two collectors around a single theme, where the contextual dialogue is logically consistent with the current content. It is suitable for training large language models (LLMs) in multi-turn back-and-forth conversation capabilities, contextual logical reasoning abilities, as well as end-to-end conversational LLMs.

提供机构：

晴数智慧

搜集汇总

数据集介绍