LCCC
收藏魔搭社区2026-05-16 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/LCCC
下载链接
链接失效反馈官方服务:
资源简介:
displayName: LCCC (Large-scale Cleaned Chinese Conversation corpus)
labelTypes:
- Chinese Corpus
license:
- LCCC Custom
paperUrl: https://arxiv.org/pdf/2008.03946v2.pdf
publishDate: "2020"
publishUrl: https://github.com/thu-coai/CDial-GPT
publisher:
- Tsinghua University
- Samsung Research China
tags:
- Sensitive Words
- Special Symbols
taskTypes:
- Dialogue Generation
- Short Text Conversation
---
# 数据集介绍
## 简介
我们提出了一个大型清洁汉语会话语料库(LCCC),其中包含:LCCC-base 和 LCCC-large。为了保证语料库的质量,设计了严格的数据清洗流水线。该管道涉及一组规则和几个基于分类器的过滤器。诸如攻击性或敏感词、特殊符号、表情符号、语法错误的句子和不连贯的对话等噪音都会被过滤掉。
## 引文
```
@inproceedings{wang2020large,
title={A large-scale chinese short-text conversation dataset},
author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
pages={91--103},
year={2020},
organization={Springer}
}
```
## Download dataset
:modelscope-code[]{type="git"}
displayName: LCCC(大规模清洗中文会话语料库,Large-scale Cleaned Chinese Conversation corpus)
labelTypes:
- 中文语料库
license:
- LCCC 自定义许可
paperUrl: https://arxiv.org/pdf/2008.03946v2.pdf
publishDate: "2020"
publishUrl: https://github.com/thu-coai/CDial-GPT
publisher:
- 清华大学(Tsinghua University)
- 三星研究中国(Samsung Research China)
tags:
- 敏感词(Sensitive Words)
- 特殊符号(Special Symbols)
taskTypes:
- 对话生成(Dialogue Generation)
- 短文本对话(Short Text Conversation)
---
# 数据集介绍
## 简介
我们提出了一款大规模清洗中文会话语料库(LCCC,Large-scale Cleaned Chinese Conversation corpus),其包含LCCC-base与LCCC-large两个子数据集。为保障语料库的整体质量,我们设计了严格的数据清洗流水线,该流水线融合了一系列规则与若干基于分类器的过滤器,可过滤攻击性或敏感词汇、特殊符号、表情符号、存在语法错误的句子以及不连贯对话等噪声数据。
## 引文
@inproceedings{wang2020large,
title={A large-scale chinese short-text conversation dataset},
author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
pages={91--103},
year={2020},
organization={Springer}
}
## 下载数据集
:modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-07-08
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



