LCCC (Large-scale Cleaned Chinese Conversation corpus)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/LCCC
下载链接
链接失效反馈官方服务:
资源简介:
我们提出了一个大型清洁汉语会话语料库(LCCC),其中包含:LCCC-base 和 LCCC-large。为了保证语料库的质量,设计了严格的数据清洗流水线。该管道涉及一组规则和几个基于分类器的过滤器。诸如攻击性或敏感词、特殊符号、表情符号、语法错误的句子和不连贯的对话等噪音都会被过滤掉。
We propose a large clean Chinese conversational corpus (LCCC), which includes LCCC-base and LCCC-large. To ensure the quality of the corpus, a strict data cleaning pipeline is designed. This pipeline involves a set of rules and several classifier-based filters. Noises such as offensive or sensitive words, special symbols, emojis, grammatically incorrect sentences and incoherent dialogues will be filtered out.
提供机构:
OpenDataLab
创建时间:
2022-06-07
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



