silver/lccc

Name: silver/lccc
Creator: silver
Published: 2022-11-06 04:51:16
License: 暂无描述

Hugging Face2022-11-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/silver/lccc

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - other language_creators: - other language: - zh license: - mit multilinguality: - monolingual size_categories: - 10M<n<100M source_datasets: - original task_categories: - conversational task_ids: - dialogue-generation pretty_name: lccc tags: - dialogue-response-retrieval --- # Dataset Card for lccc_large ## Table of Contents - [Dataset Card for lccc_large](#dataset-card-for-lccc_large) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://github.com/thu-coai/CDial-GPT - **Repository:** https://github.com/thu-coai/CDial-GPT - **Paper:** https://arxiv.org/abs/2008.03946 ### Dataset Summary lccc: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large Chinese dialogue corpus originate from Chinese social medias. A rigorous data cleaning pipeline is designed to ensure the quality of the corpus. This pipeline involves a set of rules and several classifier-based filters. Noises such as offensive or sensitive words, special symbols, emojis, grammatically incorrect sentences, and incoherent conversations are filtered. lccc是一套来自于中文社交媒体的对话数据，我们设计了一套严格的数据过滤流程来确保该数据集中对话数据的质量。这一数据过滤流程中包括一系列手工规则以及若干基于机器学习算法所构建的分类器。我们所过滤掉的噪声包括：脏字脏词、特殊字符、颜表情、语法不通的语句、上下文不相关的对话等。 ### Supported Tasks and Leaderboards - dialogue-generation: The dataset can be used to train a model for generating dialogue responses. - response-retrieval: The dataset can be used to train a reranker model that can be used to implement a retrieval-based dialogue model. ### Languages LCCC is in Chinese LCCC中的对话是中文的 ## Dataset Structure ### Data Instances ["火锅我在重庆成都吃了七八顿火锅", "哈哈哈哈！那我的嘴巴可能要烂掉！", "不会的就是好油腻"] ### Data Fields Each line is a list of utterances that consist a dialogue. Note that the LCCC dataset provided in our original Github page is in json format, however, we are providing LCCC in jsonl format here. ### Data Splits We do not provide the offical split for LCCC-large. But we provide a split for LCCC-base: |train|valid|test| |:---:|:---:|:---:| |6,820,506 | 20,000 | 10,000| ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information Please cite the following paper if you find this dataset useful: ```bibtex @inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} } ```

annotations_creators: - 其他 language_creators: - 其他 language: - 中文（zh） license: - MIT multilinguality: - 单语言 size_categories: - 1000万<n<1亿 source_datasets: - 原创 task_categories: - 对话式 task_ids: - 对话生成 pretty_name: lccc tags: - 对话响应检索（dialogue-response-retrieval） # lccc_large 数据集卡片 ## 目录 - [lccc_large 数据集卡片](#lccc_large-数据集卡片) - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概览](#数据集概览) - [支持任务与评测基准](#支持任务与评测基准) - [语言](#语言) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [数据遴选依据](#数据遴选依据) - [源数据](#源数据) - [初始数据收集与标准化](#初始数据收集与标准化) - [源语言生产者是谁？](#源语言生产者是谁？) - [注释](#注释) - [注释流程](#注释流程) - [注释者是谁？](#注释者是谁？) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集策展人](#数据集策展人) - [许可证信息](#许可证信息) - [引用信息](#引用信息) ## 数据集描述 - **主页**：https://github.com/thu-coai/CDial-GPT - **代码仓库**：https://github.com/thu-coai/CDial-GPT - **论文**：https://arxiv.org/abs/2008.03946 ### 数据集概览 LCCC（Large-scale Cleaned Chinese Conversation Corpus，大规模中文清洗对话语料库）是一套源自中文社交媒体的大型对话语料库。为保障语料库质量，我们设计了严格的数据清洗流程，该流程包含一系列规则与若干基于分类器的过滤模块。我们会过滤掉冒犯性/敏感词汇、特殊符号、表情符号、语法错误语句以及上下文不连贯的对话等噪声数据。 ### 支持任务与评测基准 - 对话生成（dialogue-generation）：该数据集可用于训练对话响应生成模型。 - 响应检索（response-retrieval）：该数据集可用于训练重排序模型，以实现基于检索的对话系统。 ### 语言该数据集的语言为中文。 ## 数据集结构 ### 数据样例 ["我在重庆、成都吃了七八顿火锅", "哈哈哈哈！那我的嘴巴可能要烂掉了！", "不会的，就是太油腻了"] ### 数据字段每一行均为构成一则对话的多轮话语（utterance）列表。需注意，官方GitHub仓库中发布的原始LCCC数据集采用JSON格式存储，而本版本采用JSONL格式。 ### 数据划分本项目未提供LCCC-large的官方划分方案，但提供了LCCC-base的划分如下： | 训练集 | 验证集 | 测试集 | |:----:|:----:|:----:| | 6,820,506 | 20,000 | 10,000 | ## 数据集构建 ### 数据遴选依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 注释 #### 注释流程 [需补充更多信息] #### 注释者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集策展人 [需补充更多信息] ### 许可证信息 [需补充更多信息] ### 引用信息若您使用本数据集，请引用以下论文： bibtex @inproceedings{wang2020chinese, title={"A Large-Scale Chinese Short-Text Conversation Dataset"}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} }

提供机构：

silver

原始信息汇总

数据集概述

数据集名称: lccc_large

数据集别名: lccc

语言: 中文 (zh)

许可: MIT

多语言性: 单语种

数据集大小: 10M<n<100M

源数据集: 原始数据

任务类别: 对话生成

任务ID: dialogue-generation

标签: dialogue-response-retrieval

数据集描述

数据集总结: lccc（Large-scale Cleaned Chinese Conversation corpus）是一个大规模的中文对话数据集，源自中文社交媒体。该数据集通过严格的数据清洗流程确保质量，包括一系列手工规则和基于机器学习算法的分类器，用于过滤掉脏字、特殊字符、颜表情、语法错误的句子及不相关的对话等噪声。
支持的任务和排行榜:
- 对话生成: 用于训练模型生成对话响应。
- 响应检索: 用于训练重排序模型，以实现基于检索的对话模型。
语言: LCCC数据集中的对话为中文。

数据集结构

数据实例: 示例对话如：["火锅我在重庆成都吃了七八顿火锅", "哈哈哈哈！那我的嘴巴可能要烂掉！", "不会的就是好油腻"]
数据字段: 每行是一个对话的语句列表。原始Github页面提供的LCCC数据集为json格式，此处提供的是jsonl格式。
数据分割: 未提供LCCC-large的官方分割，但提供了LCCC-base的分割：

train valid test

6,820,506 20,000 10,000

数据集创建

数据收集和规范化、语言生产者、注释过程、注释者、个人和敏感信息、数据集的社会影响、偏见讨论、其他已知限制、数据集管理员、许可信息、引用信息: 这些部分的信息缺失，需要更多信息。
引用信息: 如发现此数据集有用，请引用以下论文： bibtex @inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} }

搜集汇总

数据集介绍

构建方式

LCCC数据集的构建基于大规模的中文社交媒体对话数据，通过设计一套严格的数据清洗流程来确保数据质量。该流程包括一系列手工规则和基于机器学习算法的分类器，用于过滤掉包括脏字脏词、特殊字符、颜表情、语法错误和上下文不相关的对话等噪声。

特点

LCCC数据集的主要特点在于其高质量的中文对话数据，这些数据经过多层次的清洗和过滤，确保了对话的连贯性和语言的规范性。此外，数据集的规模较大，涵盖了丰富的对话场景，适用于多种对话生成任务。

使用方法

LCCC数据集可用于训练对话生成模型和响应检索模型。对于对话生成任务，数据集提供了丰富的对话上下文和响应，有助于模型学习如何生成自然流畅的对话。对于响应检索任务，数据集可以用于训练重排序模型，以实现基于检索的对话系统。

背景与挑战

背景概述

LCCC（Large-scale Cleaned Chinese Conversation corpus）是由清华大学COAI实验室开发的中文对话数据集，旨在为中文对话生成和响应检索任务提供高质量的语料支持。该数据集源自中文社交媒体，通过严格的数据清洗流程，过滤了包括敏感词汇、特殊符号、语法错误等噪声，确保了数据集的高质量。LCCC的发布时间为2020年，主要研究人员包括王一达、柯沛等，其核心研究问题是如何在中文对话生成领域提供一个大规模、高质量的对话语料库。该数据集的推出对中文自然语言处理领域，尤其是对话系统的发展具有重要意义。

当前挑战

LCCC数据集在构建过程中面临的主要挑战包括：首先，如何从海量的中文社交媒体数据中筛选出高质量的对话内容，这需要设计复杂的过滤规则和分类器；其次，如何确保数据集的多样性和代表性，避免因过度清洗导致数据集的偏差；最后，如何处理数据中的个人和敏感信息，确保数据使用的伦理性和合法性。此外，LCCC在应用中也面临对话生成模型的训练难度和响应检索任务中的偏差问题，这些问题需要在未来的研究中进一步解决。

常用场景

经典使用场景

LCCC数据集在中文对话生成领域具有广泛的应用，其经典使用场景主要体现在训练对话生成模型。通过该数据集，研究者能够构建能够生成自然、流畅且上下文相关的对话响应的模型。此外，LCCC数据集还可用于训练基于检索的对话模型，通过训练重排序模型，提升对话系统在响应检索任务中的表现。

衍生相关工作

LCCC数据集的发布催生了一系列相关研究工作，特别是在中文对话生成和检索领域。研究者们基于该数据集提出了多种对话生成模型和检索算法，进一步推动了中文对话系统的技术进步。此外，LCCC数据集的成功也为其他语言的对话数据集构建提供了参考，促进了多语言对话系统的研究与发展。

数据集最近研究