中文对话数据集

Name: 中文对话数据集
Creator: maas
Published: 2025-11-27 14:15:51
License: 暂无描述

魔搭社区2025-11-27 更新2025-01-04 收录

下载链接：

https://modelscope.cn/datasets/nlcvcln/dialogue_zh

下载链接

链接失效反馈

官方服务：

资源简介：

数据集文件元信息以及数据文件，请浏览“数据集文件”页面获取。当前数据集卡片使用的是默认模版，数据集的贡献者未提供更加详细的数据集介绍，但是您可以通过如下GIT Clone命令，或者ModelScope SDK来下载数据集 rlhf_data_1128.jsonl 改写chosen数据，生成chosen数据,对重复的回复再次改写 coig_rewrite.jsonl 对coig sample的回复进行改写，缩短长度，口语化 lccc_personaldialogue_magic_luge_persona_dedup_profile_g8_addlabel.jsonl 给无标点的对话添加标点 lccc_personaldialogue_magic_luge_persona_dedup_profile_g8_addlabel_rewritedup.jsonl 添加标点后，给重复的对话改写，减少重复性 tigerbot-alpaca-zh-0.5m_hasinput_filterlen_checkanswer.jsonl tiger 过滤回复长度的数据，让模型判断回复是否回答了问题(抽样看的时候有脏数据) psy_rewrite_1128.jsonl 回复口语化改写后，进行多样化改写 #### 下载方法 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

Dataset file metadata and data files can be obtained by browsing the "Dataset Files" page. The current dataset card uses the default template, and the dataset contributors have not provided a more detailed introduction to the dataset. However, you can download the dataset via the following GIT Clone command or ModelScope SDK: rlhf_data_1128.jsonl: Rewrite the chosen data to generate new chosen data, and rephrase duplicate responses. coig_rewrite.jsonl: Rewrite the responses of COIG samples, shorten their length and make them more colloquial. lccc_personaldialogue_magic_luge_persona_dedup_profile_g8_addlabel.jsonl: Add punctuation marks to dialogues without any punctuation. lccc_personaldialogue_magic_luge_persona_dedup_profile_g8_addlabel_rewritedup.jsonl: After adding punctuation marks, rewrite duplicate dialogues to reduce redundancy. tigerbot-alpaca-zh-0.5m_hasinput_filterlen_checkanswer.jsonl: Filter dataset entries based on response length for TigerBot, and let the model determine whether the response addresses the question (dirty data was identified during sampling checks). psy_rewrite_1128.jsonl: After colloquializing the responses, perform diverse paraphrasing. Download Methods: :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

提供机构：

maas

创建时间：

2024-12-31

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个中文对话数据集，主要用于对话数据的预处理和增强，包括对重复回复的改写、标点添加、长度过滤以及口语化处理，以提升数据质量。数据集包含多个JSONL文件，如rlhf_data_1128.jsonl和coig_rewrite.jsonl，适用于自然语言处理模型的训练和优化。

以上内容由遇见数据集搜集并总结生成