THUCNews

Name: THUCNews
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-251.html

下载链接

链接失效反馈

官方服务：

资源简介：

THUCTC (THU Chinese Text Classification) is a Chinese text classification toolkit launched by the Natural Language Processing Laboratory of Tsinghua University, which can automatically and efficiently implement user-defined text classification corpus training, evaluation, and classification functions. Text classification usually includes three steps: feature selection, feature dimensionality reduction, and classification model learning. How to select appropriate text features and reduce dimensionality is a challenging problem for Chinese text classification. Based on years of research experience in Chinese text classification, my group selected two-character string bigram as the feature unit in THUCTC, the feature reduction method is Chi-square, the weight calculation method is tfidf, and the classification model uses LibSVM or LibLinear. THUCTC has good universality for long texts in the open field, does not depend on the performance of any Chinese word segmentation tools, and has the advantages of high accuracy and fast test speed. THUCNews is generated by filtering and filtering historical data of Sina News RSS subscription channels from 2005 to 2011. It contains 740,000 news documents (2.19 GB), all in UTF-8 plain text format. On the basis of the original Sina news classification system, we re-integrated and divided 14 candidate classification categories: finance, lottery, real estate, stocks, home furnishing, education, technology, society, fashion, current affairs, sports, horoscope, games, entertainment. Using THUCTC toolkit to evaluate on this data set, the accuracy rate can reach 88.6%.

THUCTC（THU Chinese Text Classification）是清华大学自然语言处理实验室推出的中文文本分类工具包，可自动高效实现用户自定义的文本分类语料的训练、评估与分类功能。文本分类通常包含三个核心步骤：特征选择、特征降维与分类模型学习。如何选取恰当的文本特征并完成降维，是中文文本分类领域颇具挑战性的问题。依托本团队在中文文本分类方向多年的研究积累，THUCTC选用双字符串二元语法（bigram）作为特征单元，特征降维方法采用卡方检验（Chi-square），权重计算方法采用tfidf，分类模型则使用LibSVM或LibLinear。THUCTC对开放领域的长文本具有良好的通用性，不依赖任何中文分词工具的性能，兼具分类精度高、测试速度快的优势。THUCNews数据集通过对2005年至2011年新浪新闻RSS订阅频道的历史数据进行筛选提纯得到，共包含74万篇新闻文档（总大小2.19 GB），全部采用UTF-8纯文本格式。我们在原新浪新闻分类体系的基础上，重新整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐。使用THUCTC工具包在该数据集上进行评估，分类准确率可达88.6%。

提供机构：

帕依提提

搜集汇总

数据集介绍

背景与挑战

背景概述

THUCNews是一个中文文本分类数据集，包含74万篇新浪新闻文档，覆盖14个分类类别，采用UTF-8纯文本格式。该数据集通过THUCTC工具包评估，分类准确率达到88.6%，适用于开放领域的长文本分类任务。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集