An adaptable indexing pipeline for enriching meta information of datasets from heterogeneous repositories

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/3yb7mhxtyf

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts' suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.

数据集仓储（Dataset repository）会在生物多样性、海洋学等诸多学科领域中持续发布大量数据集。为开展多学科研究，科研人员与行业从业者需从自身不熟悉的其他学科领域中发现所需数据集。诸如谷歌数据集搜索（Google Dataset）、Mendeley数据（Mendeley Data）等知名搜索引擎，尝试基于数据集内容为科研人员提供跨领域数据集发现服务。然而，由于数据集通常包含科学观测数据或服务提供商采集的原始数据，其附带的上下文信息往往较为有限。故此，仅依托数据集的上下文信息开展有效的数据集索引构建，将无法提升其FAIR性——即可发现性、可访问性、互操作性与可复用性。本文提出了一种数据集索引构建流水线，该方法借助主题建模技术、领域专家提出的规则集与领域关键词（例如环境科学中的核心变量），基于数据集所属的科学领域拓展其上下文信息。该索引流水线依托一个开放生态系统，数据集仓储的提供者可在此平台上发布经过语义增强的元数据。我们对上述元数据进行聚合、标准化与对齐处理，以此构建一款数据集搜索引擎，助力科研社群实现数据集的查找、访问、整合与复用。我们通过人工构建的金标准数据集与用户研究，对所提方法进行了性能评估。

创建时间：

2022-03-01