Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation

Figshare2016-02-09 更新2026-04-29 收录

下载链接：

https://figshare.com/articles/dataset/_Locally_Embedding_Autoencoders_A_Semi_Supervised_Manifold_Learning_Approach_of_Document_Representation_/1639069

下载链接

链接失效反馈

官方服务：

资源简介：

Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.

主题模型与神经网络可从文本语料库中挖掘出具有实际意义的低维潜在表示（latent representation），因此二者已成为文档表示（document representation）领域的核心技术之一。然而此类模型默认所有文档均无区分性，这会导致潜在表示依赖于其余所有文档，进而无法生成具备区分能力的文档表示。为解决该问题，本文提出一种半监督流形启发式自编码器（manifold-inspired autoencoder）以提取文档的有效潜在表示，其核心局部视角为：相邻文档的潜在表示应当具备相关性。我们首先在观测空间中通过欧氏距离（Euclidean distance）确定具备区分性的邻域集合；随后，通过联合最小化输入与输出间的伯努利交叉熵损失，以及输入样本与其邻域样本的输出间的均方误差之和，完成自编码器的训练。基于两个广泛使用的语料库的实验结果表明，相较于对比方法，本文所提方法在文档聚类任务中至少实现了15%的性能提升，在分类任务中则获得了近7%的性能增益。实验证据证实，本文方法能够更高效地为新文档提取更具区分性的潜在表示；此外，通过激活可增强潜在表示可理解性的特征，我们还能有效挖掘出若干具有实际意义的词汇组合。

创建时间：

2016-02-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集