MVR performance t-test result.
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/MVR_performance_t-test_result_/26821248
下载链接
链接失效反馈官方服务:
资源简介:
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
文本聚类是数据挖掘与信息检索领域的核心任务,其目标是将未标注文本划分为具有语义合理性的簇,以助力从大规模文本数据中提取并理解有用信息。然而,短文本聚类(Short Text Clustering,STC)颇具挑战性,这类文本普遍存在信息稀疏、语义模糊、包含噪声且信息匮乏的问题。短文本聚类面临的核心挑战之一,在于为短文本文档找到合适的表征方式,以生成具有内聚性的簇。但现有短文本聚类方法通常仅采用单视图表征开展聚类工作,由于无法刻画目标文本的不同维度,单视图表征在文本建模上效率偏低。
本文提出了适配性最优的多视图表征(Multi-view Representation,MVR)方案——通过寻找不同单视图表征的最佳组合,以优化短文本聚类任务。我们的工作将基于不同的单视图表征组合集合,探索多种类型的多视图表征方案。单视图表征的组合通过主成分分析(Principal Component Analysis,PCA)技术实现固定长度的拼接。
我们采用三个标准数据集(Twitter、谷歌新闻(Google News)、StackOverflow),以评估各类多视图表征集合在短文本聚类任务中的表现。基于实验结果,适配短文本聚类任务的最优单视图表征组合为5视图多视图表征,即结合了BERT、GPT、TF-IDF、FastText与GloVe的表征方案。据此我们可以得出结论:多视图表征能够有效提升短文本聚类的性能,但多视图表征的设计需要对单视图表征进行选择性筛选。
创建时间:
2024-08-23



