---
annotations_creators:
- other
language_creators:
- other
language:
- zh
license:
- mit
multilinguality:
- monolingual
size_categories:
- 10M<n<100M
source_datasets:
- original
task_categories:
- conversational
task_ids:
- dialogue-generation
pretty_name: lccc
tags:
- dialogue-response-retrieval
---
# Dataset Card for lccc_large
## Table of Contents
- [Dataset Card for lccc_large](#dataset-card-for-lccc_large)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://github.com/thu-coai/CDial-GPT
- **Repository:** https://github.com/thu-coai/CDial-GPT
- **Paper:** https://arxiv.org/abs/2008.03946
### Dataset Summary
lccc: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large Chinese dialogue corpus originate from Chinese social medias. A rigorous data cleaning pipeline is designed to ensure the quality of the corpus. This pipeline involves a set of rules and several classifier-based filters. Noises such as offensive or sensitive words, special symbols, emojis, grammatically incorrect sentences, and incoherent conversations are filtered.
lccc是一套来自于中文社交媒体的对话数据,我们设计了一套严格的数据过滤流程来确保该数据集中对话数据的质量。 这一数据过滤流程中包括一系列手工规则以及若干基于机器学习算法所构建的分类器。 我们所过滤掉的噪声包括:脏字脏词、特殊字符、颜表情、语法不通的语句、上下文不相关的对话等。
### Supported Tasks and Leaderboards
- dialogue-generation: The dataset can be used to train a model for generating dialogue responses.
- response-retrieval: The dataset can be used to train a reranker model that can be used to implement a retrieval-based dialogue model.
### Languages
LCCC is in Chinese
LCCC中的对话是中文的
## Dataset Structure
### Data Instances
["火锅 我 在 重庆 成都 吃 了 七八 顿 火锅", "哈哈哈哈 ! 那 我 的 嘴巴 可能 要 烂掉 !", "不会 的 就是 好 油腻"]
### Data Fields
Each line is a list of utterances that consist a dialogue.
Note that the LCCC dataset provided in our original Github page is in json format,
however, we are providing LCCC in jsonl format here.
### Data Splits
We do not provide the offical split for LCCC-large.
But we provide a split for LCCC-base:
|train|valid|test|
|:---:|:---:|:---:|
|6,820,506 | 20,000 | 10,000|
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
Please cite the following paper if you find this dataset useful:
```bibtex
@inproceedings{wang2020chinese,
title={A Large-Scale Chinese Short-Text Conversation Dataset},
author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
booktitle={NLPCC},
year={2020},
url={https://arxiv.org/abs/2008.03946}
}
```
annotations_creators:
- 其他
language_creators:
- 其他
language:
- 中文(zh)
license:
- MIT
multilinguality:
- 单语言
size_categories:
- 1000万<n<1亿
source_datasets:
- 原创
task_categories:
- 对话式
task_ids:
- 对话生成
pretty_name: lccc
tags:
- 对话响应检索(dialogue-response-retrieval)
# lccc_large 数据集卡片
## 目录
- [lccc_large 数据集卡片](#lccc_large-数据集卡片)
- [目录](#目录)
- [数据集描述](#数据集描述)
- [数据集概览](#数据集概览)
- [支持任务与评测基准](#支持任务与评测基准)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据样例](#数据样例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [数据遴选依据](#数据遴选依据)
- [源数据](#源数据)
- [初始数据收集与标准化](#初始数据收集与标准化)
- [源语言生产者是谁?](#源语言生产者是谁?)
- [注释](#注释)
- [注释流程](#注释流程)
- [注释者是谁?](#注释者是谁?)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集策展人](#数据集策展人)
- [许可证信息](#许可证信息)
- [引用信息](#引用信息)
## 数据集描述
- **主页**:https://github.com/thu-coai/CDial-GPT
- **代码仓库**:https://github.com/thu-coai/CDial-GPT
- **论文**:https://arxiv.org/abs/2008.03946
### 数据集概览
LCCC(Large-scale Cleaned Chinese Conversation Corpus,大规模中文清洗对话语料库)是一套源自中文社交媒体的大型对话语料库。为保障语料库质量,我们设计了严格的数据清洗流程,该流程包含一系列规则与若干基于分类器的过滤模块。我们会过滤掉冒犯性/敏感词汇、特殊符号、表情符号、语法错误语句以及上下文不连贯的对话等噪声数据。
### 支持任务与评测基准
- 对话生成(dialogue-generation):该数据集可用于训练对话响应生成模型。
- 响应检索(response-retrieval):该数据集可用于训练重排序模型,以实现基于检索的对话系统。
### 语言
该数据集的语言为中文。
## 数据集结构
### 数据样例
["我在重庆、成都吃了七八顿火锅", "哈哈哈哈!那我的嘴巴可能要烂掉了!", "不会的,就是太油腻了"]
### 数据字段
每一行均为构成一则对话的多轮话语(utterance)列表。需注意,官方GitHub仓库中发布的原始LCCC数据集采用JSON格式存储,而本版本采用JSONL格式。
### 数据划分
本项目未提供LCCC-large的官方划分方案,但提供了LCCC-base的划分如下:
| 训练集 | 验证集 | 测试集 |
|:----:|:----:|:----:|
| 6,820,506 | 20,000 | 10,000 |
## 数据集构建
### 数据遴选依据
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 注释
#### 注释流程
[需补充更多信息]
#### 注释者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集策展人
[需补充更多信息]
### 许可证信息
[需补充更多信息]
### 引用信息
若您使用本数据集,请引用以下论文:
bibtex
@inproceedings{wang2020chinese,
title={"A Large-Scale Chinese Short-Text Conversation Dataset"},
author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
booktitle={NLPCC},
year={2020},
url={https://arxiv.org/abs/2008.03946}
}