Dates related to the research results presented in the article: Conceptual Framework for Clustering, Labeling, and Evaluating Scientific Articles with Embedding Models and Bibliometric Analysis

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/v63vrhgwxy

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Description for the Article: "Conceptual Framework for Clustering, Labeling, and Evaluating Scientific Articles with Embedding Models and Bibliometric Analysis" The dataset underpins the experiments described in the article, which proposes a framework for clustering, labeling, and evaluating scientific publications using both statistical and bibliometric indicators. It consists of three CSV files, each corresponding to a different collection of scholarly publications (small, medium, and large). All publications were retrieved from the Scopus database based on specific search queries. CSV Files (3 in total) - Each CSV file contains: Ranked lists of scholarly articles retrieved from Scopus, supplemented with computed vectors. Embedding vectors representing article content (title + abstract). Distance metrics (e.g., cosine distance) to enable quick similarity comparisons. Together, these CSV files mirror the experimental setup discussed in the article, allowing reproduction of the clustering and labeling processes, as well as subsequent evaluations via bibliometric and statistical approaches. Data Structure: Standard Metadata Columns (from Scopus export): These include typical bibliographic information such as: Title, Authors, Year, DOI, Source title, Abstract, Keywords, Affiliations, Document Type, Cited by, and others. Computed Columns (for experimentation): combined_embeddings or article_embedding: A list of numeric values representing the semantic embedding vector generated from the concatenated title and abstract of the publication. distance_cosine: The cosine distance between the publication’s embedding and a reference embedding (e.g., based on a user query). Values range from 0 to 1, where lower values indicate higher semantic similarity. Purpose and Use: These data support the evaluation of embedding-based clustering, labeling, and bibliometric methods for automating systematic literature reviews. They serve as reproducible material for the experiments described in the paper.

创建时间：

2025-06-05