Comparison of document categorization process using key words and citations in a restricted knowledge domain

NIAID Data Ecosystem2026-03-10 收录

下载链接：

https://figshare.com/articles/dataset/Comparison_of_document_categorization_process_using_key_words_and_citations_in_a_restricted_knowledge_domain/7511765

下载链接

链接失效反馈

官方服务：

资源简介：

Abstract The categorization process requires the extraction of representative elements from a document so that its essence can be used to identify similarities among documents and generate categories. The objective of this study was to analyze the difficulties and results from two different processes of document categorization in a restricted knowledge domain. The first one was based on the use of keywords and the second was based on the use of citations for document representation. To illustrate the use of different attributes in document representation, two experiments were conducted. The first one used a categorization algorithm based on keywords. The second experiment generated categories, using Artificial Neural Networks, from the citations of the articles. In the restricted knowledge domain, as used in this study, it was difficult to form groups that use keywords as attributes of the categorization process due to the great similarity of keywords used by the authors. The citations can be, as shown in the second experiment, an alternative and more efficient attribute for the categorization process of these documents. The formation of a set of articles with significant bibliographic coupling and a strong semantic relationship validated the method proposed. The article details the methodology used in the experiments, showing the importance of careful pre-processing phase for the reliability of the databases. This study may contribute to the research related to the representation of documents in categorization processes and information retrieval.

摘要文档分类任务需从单篇文档中提取代表性要素，借此依托文档的本质识别文档间的相似性并生成分类类别。本研究旨在分析受限知识领域内两种不同文档分类流程的难点与最终结果：第一种流程依托关键词实现文档表征，第二种则基于引文完成文档表征。为演示不同属性在文档表征中的应用，本研究开展了两组实验：第一组实验采用基于关键词的分类算法；第二组实验则依托论文引文，借助人工神经网络（Artificial Neural Networks）生成分类类别。在本研究使用的受限知识领域中，由于作者使用的关键词高度相似，以关键词作为分类属性构建分类群组的难度较大。如第二组实验所示，引文可作为此类文档分类流程中更高效的替代属性。通过构建具备显著文献耦合性与强语义关联的论文集合，验证了本研究提出的方法的有效性。本文详细阐述了实验所采用的方法论，阐明了严谨的预处理阶段对数据库可靠性的重要意义。本研究可为分类任务与信息检索领域中的文档表征相关研究提供参考。

创建时间：

2016-04-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集