THUCNews Chinese News Text Classification Dataset
收藏DataCite Commons2025-06-01 更新2025-05-07 收录
下载链接:
https://figshare.com/articles/dataset/THUCNews_Chinese_News_Text_Classification_Dataset/28279964/2
下载链接
链接失效反馈官方服务:
资源简介:
THUCTC (THU Chinese Text Classification) is a Chinese text classification toolkit developed by the Natural Language Processing Laboratory of Tsinghua University. It can efficiently automate the training, evaluation, and classification of user-defined text classification corpora. Text classification typically involves three steps: feature selection, feature dimensionality reduction, and model training. Selecting appropriate text features and performing dimensionality reduction are challenging problems in Chinese text classification. Based on years of research experience in Chinese text classification, our team has chosen bigram (two-character strings) as the feature unit in THUCTC, with Chi-square as the dimensionality reduction method, tf-idf as the weight calculation method, and LibSVM or LibLinear as the classification model. THUCTC demonstrates good versatility for open-domain long texts, is independent of the performance of any Chinese word segmentation tool, and offers the advantages of high accuracy and fast testing speed.
THUCTC(THU Chinese Text Classification)是由清华大学自然语言处理实验室研发的中文文本分类工具包。它可高效自动化完成用户自定义文本分类语料的训练、评估与分类任务。文本分类通常包含三个核心步骤:特征选择、特征降维与模型训练。选取适配的文本特征并完成降维,是中文文本分类领域的棘手难题。依托团队在中文文本分类领域积累的多年研究经验,THUCTC选用二元语法(bigram,双字符字符串)作为特征单元,以卡方检验(Chi-square)作为降维方法,以tf-idf作为权重计算方式,并以LibSVM或LibLinear作为分类模型。THUCTC对开放域长文本具备良好的通用性,不依赖任何中文分词工具的性能,同时兼具分类精度高、测试速度快的显著优势。
提供机构:
figshare
创建时间:
2025-01-25
搜集汇总
数据集介绍

背景与挑战
背景概述
THUCNews中文新闻文本分类数据集是一个用于中文文本分类的数据集,由清华大学自然语言处理实验室开发,采用bigram特征单元和Chi-square降维方法,适用于开放领域长文本分类,具有高准确性和快速测试的特点。
以上内容由遇见数据集搜集并总结生成



