five

Semi-unsupervised learning with term weighting for text clustering

收藏
DataCite Commons2025-09-07 更新2026-05-04 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2019.1676
下载链接
链接失效反馈
官方服务:
资源简介:
In distance-based (similarity-based) constrained clustering, there have been various approaches on how to define distance (similarity) between objects in order to guide similar objects to be grouped together and dissimilar objects to be separated from each other. This dissertation presents a framework of similarity-based constrained clustering, where statistics extracted from a small number of cue instances with their known classes are used as the term weighting scheme to guide the clustering process. In the framework, besides two well-known weightings (term frequency and inverse document frequency), three additional statistics, i.e., the statistics for characterizing the whole collection (in-collection), those for characterizing a class (intra-class), and those for contrasting classes (inter-class), are used for emphasizing or not emphasizing the term. The weight of a term is the multiplication of these in-collection, intra-class, and inter-class statistics, where a positive (or negative) exponent is given to each term with weighting to promote (or demote) the effect of the term on grouping (clustering). In this research, two alternative term weightings named (1) deviation-based and (2) entropy-based distributions are compared. For more the evaluation the impact of term weighting, the experiment exploits the statistics from the comparison of distribution-based term weighting of user-intention, the varied training set sizes, and the varied number of clusters. The performance is evaluated using five text datasets are used; drug information, 20newsgroup, amazon comments, webkb, and Thai reform text collections. The proposed method is evaluated on three groups of criteria, i.e., class-based measures, clustering-based measures, and similarity-based measures.
提供机构:
Thammasat University
创建时间:
2025-09-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作