Classifying document types to enhance search and recommendations in digital libraries

Name: Classifying document types to enhance search and recommendations in digital libraries
Creator: figshare
Published: 2020-09-02 13:01:47
License: 暂无描述

DataCite Commons2020-09-02 更新2024-07-25 收录

下载链接：

https://figshare.com/articles/dataset/Classifying_document_types_to_enhance_search_and_recommendations_in_digital_libraries/4834229/1

下载链接

链接失效反馈

官方服务：

资源简介：

Taken from "Classifying document types to enhance search and recommendations in digital libraries" https://www.overleaf.com/read/zzzrvmzmwdck Abstract: In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60\% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience. The descriptors, as featured in the study, are encoded in the dataset as follows: authors_len: Number of authors associated with the document entry.num_of_pages: Number of pages the document has in total.avg_word_per_page: Average words per page in the document.total_words: Total words in the document.source: The online service from which the document originated (can be either "CORE" or "SlideShare").id: Identifier with which the source's API can be queried to retrieve the corresponding document.label: The document's type, from "research", "thesis" or "slides".

本文摘编自《面向数字图书馆（digital libraries）搜索与推荐优化的文档类型分类研究》 https://www.overleaf.com/read/zzzrvmzmwdck 摘要：本文针对全球开放获取（open access）知识库网络中可获取的文档按类型分类的问题展开研究。研究发现，超60%的知识库文档缺失可用于区分研究论文、学位论文与幻灯片的元数据。尽管这类描述文档类型的元数据在科研分析、优化搜索与推荐（SR）系统等诸多场景中均具备实用价值，但当前知识库基础设施体系下，该问题尚未得到充分解决。本文提出一种仅基于文本专属特征的监督机器学习新方法，用于文档类型分类。在实验数据集上，随机森林（random forest）与自适应提升（Adaboost）分类器表现最优，其F1分数可达0.96。通过分析CORE数字图书馆聚合平台的SR系统日志，我们发现用户点击研究论文与学位论文的概率是幻灯片的一个数量级以上。这表明，在数字图书馆中将文档类型作为搜索与推荐结果排序/过滤的特征，有望显著优化用户体验。 本研究涉及的特征项在数据集中的编码规则如下： authors_len: 文档条目的作者数量。num_of_pages: 文档总页数。avg_word_per_page: 文档单页平均词数。total_words: 文档总词数。source: 文档来源的在线服务（仅可为“CORE”或“SlideShare”）。id: 可通过来源平台API查询对应文档的唯一标识符。label: 文档类型，可选值为research（研究论文）、thesis（学位论文）或slides（幻灯片）。

提供机构：

figshare

创建时间：

2017-04-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集