【我遇到的问题】 • 现象:该数据集的下载链接已失效 【相关信息】 • 可考虑访问这个链接获取类似文件~https://www.selectdataset.com/dataset/3688356173feccbcf1f1e490ddc6bc72
中文高质量大模型多轮对话SFT数据集
收藏OpenDataLab2024-05-30 收录
下载链接:
https://opendatalab.com/MagicHub/LLM-SFT-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于晴数智慧LLM多领域超自然SFT多轮对话文本数据集。该数据集包含97184轮中文自然对话句子,涉及【家庭生活、教育医疗、军事战争、科学技术、气候环境、人文科学、商业经济、数码产品、体育竞技、休闲娱乐、衣食住行、艺术美术、政治法律、职业发展、宗教信仰】15个主题。领域覆盖多样,也可以单独抽取相关领域的数据进行领域SFT。本次开源的部分数据,由来自中国的644名不同ID的采集人独家贡献,北京晴数智慧科技有限公司进行授权采集。每组对话由两位采集人围绕一个主题展开,上下文对话与当前的内容逻辑相关。适用于训练大模型多轮对话 (back and forth conversation)、上下文逻辑推理能力,以及端到端对话大模型。
This dataset is derived from the Qingshu Wisdom LLM multi-domain unscripted natural multi-turn dialogue supervised fine-tuning (SFT) text dataset. It contains 97,184 rounds of natural Chinese conversational sentences, covering 15 themes: Family Life, Education and Healthcare, Military Affairs and Warfare, Science and Technology, Climate and Environment, Humanities, Business and Economics, Digital Products, Sports Competitions, Leisure and Entertainment, Daily Life (Clothing, Food, Housing and Transportation), Art and Fine Arts, Politics and Law, Career Development, and Religious Beliefs. With diverse domain coverage, the dataset also allows for the extraction of domain-specific data for domain-level supervised fine-tuning. The partially open-sourced data was exclusively contributed by 644 collectors with unique IDs from China, and the collection work was authorized by Beijing Qingshu Wisdom Technology Co., Ltd. Each dialogue session is developed by two collectors around a single theme, where the contextual dialogue is logically consistent with the current content. It is suitable for training large language models (LLMs) in multi-turn back-and-forth conversation capabilities, contextual logical reasoning abilities, as well as end-to-end conversational LLMs.
提供机构:
晴数智慧
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个中文高质量大模型多轮对话SFT数据集,包含约9.7万轮真实人类自然对话句子,覆盖家庭生活、教育医疗等15个主题,具有情感丰富、领域相关和高表现力的特点,适用于训练大模型的多轮对话和上下文逻辑推理能力。数据集采用非商业许可协议,仅可用于非商业用途。
以上内容由遇见数据集搜集并总结生成



