LSICC
收藏arXiv2018-11-26 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/1811.10167v1
下载链接
链接失效反馈官方服务:
资源简介:
LSICC是由复旦大学数据科学学院和GSQ Tec.联合创建的大型非正式中文语料库,包含约3700万本书评和5万条网民新闻评论。数据集主要来源于DouBan Dushu和Chiphell网站,涵盖了从口语到网络俚语的多种非正式中文表达。创建过程中,研究人员对原始数据进行了简化中文转换、短评论删除和特殊字符标识等预处理。LSICC适用于深度学习模型训练,特别在情感分析、中文分词等领域具有重要应用价值,有助于解决现有中文语料库与实际应用中非正式中文表达的差距问题。
LSICC is a large-scale informal Chinese corpus jointly created by the School of Data Science of Fudan University and GSQ Tec. It contains approximately 37 million book reviews and 50,000 netizen news comments. The corpus is primarily sourced from DouBan Dushu and Chiphell websites, covering a diverse range of informal Chinese expressions spanning from colloquial speech to internet slang. During its development, researchers performed preprocessing steps on the raw data, including conversion to simplified Chinese, removal of short comments, and special character tagging. LSICC is suitable for deep learning model training, and has important application value in fields such as sentiment analysis and Chinese word segmentation. It helps bridge the gap between existing Chinese corpora and the informal Chinese expressions employed in real-world applications.
提供机构:
复旦大学数据科学学院
创建时间:
2018-11-26
搜集汇总
数据集介绍

背景与挑战
背景概述
LSICC是一个大型非正式中文语料库,由复旦大学数据科学学院和GSQ Tec.联合创建,包含约3700万本书评和5万条网民新闻评论,数据来源于DouBan Dushu和Chiphell网站,涵盖了口语和网络俚语等多种表达。经过预处理后,该数据集适用于深度学习训练,在情感分析和中文分词等领域具有重要应用价值,能有效弥补现有中文语料库与实际非正式中文表达之间的差距。
以上内容由遇见数据集搜集并总结生成



