THUCNews Chinese News Text Classification Dataset

Name: THUCNews Chinese News Text Classification Dataset
Creator: figshare
Published: 2025-06-01 03:43:40
License: 暂无描述

DataCite Commons2025-06-01 更新2025-05-07 收录

下载链接：

https://figshare.com/articles/dataset/THUCNews_Chinese_News_Text_Classification_Dataset/28279964/2

下载链接

链接失效反馈

官方服务：

资源简介：

THUCTC (THU Chinese Text Classification) is a Chinese text classification toolkit developed by the Natural Language Processing Laboratory of Tsinghua University. It can efficiently automate the training, evaluation, and classification of user-defined text classification corpora. Text classification typically involves three steps: feature selection, feature dimensionality reduction, and model training. Selecting appropriate text features and performing dimensionality reduction are challenging problems in Chinese text classification. Based on years of research experience in Chinese text classification, our team has chosen bigram (two-character strings) as the feature unit in THUCTC, with Chi-square as the dimensionality reduction method, tf-idf as the weight calculation method, and LibSVM or LibLinear as the classification model. THUCTC demonstrates good versatility for open-domain long texts, is independent of the performance of any Chinese word segmentation tool, and offers the advantages of high accuracy and fast testing speed.

THUCTC（THU Chinese Text Classification）是由清华大学自然语言处理实验室研发的中文文本分类工具包。它可高效自动化完成用户自定义文本分类语料的训练、评估与分类任务。文本分类通常包含三个核心步骤：特征选择、特征降维与模型训练。选取适配的文本特征并完成降维，是中文文本分类领域的棘手难题。依托团队在中文文本分类领域积累的多年研究经验，THUCTC选用二元语法（bigram，双字符字符串）作为特征单元，以卡方检验（Chi-square）作为降维方法，以tf-idf作为权重计算方式，并以LibSVM或LibLinear作为分类模型。THUCTC对开放域长文本具备良好的通用性，不依赖任何中文分词工具的性能，同时兼具分类精度高、测试速度快的显著优势。

提供机构：

figshare

创建时间：

2025-01-25

搜集汇总

数据集介绍