Slim-LCCC-zh

Name: Slim-LCCC-zh
Creator: maas
Published: 2025-12-03 10:45:50
License: 暂无描述

魔搭社区2025-12-03 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/Slim-LCCC-zh

下载链接

链接失效反馈

官方服务：

资源简介：

### Load with SDK ```bash from modelscope import MsDataset ds = MsDataset.load('AI-ModelScope/Slim-LCCC-zh', split='train') print(next(iter(ds))) ``` ### Clone with HTTP ```bash git clone https://www.modelscope.cn/datasets/AI-ModelScope/Slim-LCCC-zh.git ``` ### 简介在LLM横行的今天，大家都在讲究SFT数据质量。相比于各种一板一眼的AI回复，又是step by step又是detailed reasoning，这种非常casual的对话显得那么的独特，更适合用作情感陪伴闲聊机器人的目的。本项目提供了一个大规模中文对话数据集，原始数据来自于清华大学的LCCC(Large-scale Cleaned Chinese Conversation)数据集基于LCCC-large，但因为有1200万。故使用bert-base-chinese转换为embedding，且使用类knn的方法抽取了1万条。并转换成了sharegpt格式。从实用的角度来说，因为对话都只有两句，需要通过GPT进行续写。但是实测发现openai系列的太严肃了，失去了casual的味道。浅测了一下文心一言可以续写这种闲聊对话。只是测试了一下，并没有放在这个数据集中。当然了，最好的还是收集真实世界的对话。

### Load via SDK bash from modelscope import MsDataset ds = MsDataset.load('AI-ModelScope/Slim-LCCC-zh', split='train') print(next(iter(ds))) ### Clone via HTTP bash git clone https://www.modelscope.cn/datasets/AI-ModelScope/Slim-LCCC-zh.git ### Introduction In the era of widespread LLM applications, increasing attention has been paid to the quality of SFT datasets. Compared with rigid AI responses that follow step-by-step detailed reasoning processes, these highly casual conversations are uniquely distinctive, making them more suitable for developing emotional companion chatbots. This project provides a large-scale Chinese conversational dataset, whose original source is the LCCC (Large-scale Cleaned Chinese Conversation) dataset from Tsinghua University. The original LCCC-large dataset contains 12 million samples. To handle this large scale, we converted the samples into embeddings using bert-base-chinese, and extracted 10,000 samples via a KNN-like method, then converted the extracted samples into the ShareGPT format. From a practical perspective, each conversation in this dataset only consists of two turns, so continuation generation is required using GPT models. However, our tests show that OpenAI series models produce overly formal responses, losing the casual tone required for such chats. We briefly tested Wenxin Yiyan, which can generate suitable continuations for these casual conversations, but this test was only conducted on a small scale and the extended data was not included in this dataset. Naturally, the best approach is still to collect real-world conversational data.

提供机构：

maas

创建时间：

2024-01-22

搜集汇总

数据集介绍