A Holistic Framework for SBERT-Based Text Clustering via Single-Epoch Contrastive Refinement and Dimensionality Reduction

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/55g37wc9r3

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains randomly sampled subsets of several widely used public text datasets, including AG News, BBC News, HuffPost News, Yahoo Answers, and SearchSnippets along with the experimental code used in this study. The subsets were created to enable controlled and reproducible clustering experiments while preserving the original category distributions. The primary goal of this work is to study text clustering behavior in high-dimensional embedding spaces and to evaluate whether a single-pass, geometry-aware clustering framework can produce stable and well-separated clusters. The proposed approach, named Single-E-Clust, focuses on improving the geometric structure of text representations rather than repeatedly optimizing clustering objectives. All CSV files correspond to preprocessed versions of the sampled datasets. Each file contains raw text samples and their original labels, which are retained exclusively for evaluation and are not used during clustering. The preprocessing pipeline is implemented in Preprocess.py and includes text normalization and cleaning steps to remove redundant elements that do not contribute positively to clustering performance. The core method and its ablation study are implemented in Single-E-Clust.py. The proposed framework follows a sequential pipeline in which documents are first encoded using a base SBERT model, followed by intrinsic dimensionality estimation and dimensionality reduction with UMAP to obtain geometrically meaningful low-dimensional representations. K-Means clustering is then applied to generate pseudo-labels, which are used to fine-tune SBERT with a supervised contrastive loss. This fine-tuning is deliberately restricted to a single epoch to avoid excessive embedding compactness caused by pseudo-label supervision. The resulting fine-tuned SBERT model is used to generate the final embeddings for clustering. The impact of different epoch counts on clustering performance is analyzed using Epoch_Compare.py, while Tsne_Panel.py provides t-SNE–based visualizations for qualitative comparison of embedding geometry and cluster separability.

创建时间：

2026-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集