Dates related to the research results presented in the article: Conceptual Framework for Clustering, Labeling, and Evaluating Scientific Articles with Embedding Models and Bibliometric Analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/v63vrhgwxy

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Description for the Article: "Conceptual Framework for Clustering, Labeling, and Evaluating Scientific Articles with Embedding Models and Bibliometric Analysis" The dataset underpins the experiments described in the article, which proposes a framework for clustering, labeling, and evaluating scientific publications using both statistical and bibliometric indicators. It consists of three CSV files, each corresponding to a different collection of scholarly publications (small, medium, and large). All publications were retrieved from the Scopus database based on specific search queries. CSV Files (3 in total) - Each CSV file contains: Ranked lists of scholarly articles retrieved from Scopus, supplemented with computed vectors. Embedding vectors representing article content (title + abstract). Distance metrics (e.g., cosine distance) to enable quick similarity comparisons. Together, these CSV files mirror the experimental setup discussed in the article, allowing reproduction of the clustering and labeling processes, as well as subsequent evaluations via bibliometric and statistical approaches. Data Structure: Standard Metadata Columns (from Scopus export): These include typical bibliographic information such as: Title, Authors, Year, DOI, Source title, Abstract, Keywords, Affiliations, Document Type, Cited by, and others. Computed Columns (for experimentation): combined_embeddings or article_embedding: A list of numeric values representing the semantic embedding vector generated from the concatenated title and abstract of the publication. distance_cosine: The cosine distance between the publication’s embedding and a reference embedding (e.g., based on a user query). Values range from 0 to 1, where lower values indicate higher semantic similarity. Purpose and Use: These data support the evaluation of embedding-based clustering, labeling, and bibliometric methods for automating systematic literature reviews. They serve as reproducible material for the experiments described in the paper.

## 本文数据集说明：基于嵌入模型与文献计量分析的科学文章聚类、标注与评估概念框架本数据集支撑本文所述实验，本文提出了一种结合统计指标与文献计量指标的科学出版物聚类、标注与评估框架。数据集包含3个CSV文件，分别对应小规模、中规模与大规模的学术出版物合集。所有出版物均基于特定检索策略从Scopus数据库中获取。 ### CSV文件（共3份）每份CSV文件包含以下内容： 1. 从Scopus数据库获取的学术文章排名列表，并补充了计算得到的向量数据； 2. 表征文章内容（标题+摘要）的嵌入向量（embedding vector）； 3. 用于快速相似度比对的距离度量指标（如余弦距离（cosine distance））。上述三份CSV文件完整复现了本文所述的实验设置，支持研究者复现聚类与标注流程，以及后续通过文献计量与统计方法开展的评估工作。 ### 数据结构 #### 标准元数据列（源自Scopus导出数据）包含典型的文献计量信息，例如：标题、作者、发表年份、数字对象标识符（DOI, Digital Object Identifier）、刊名、摘要、关键词、作者机构、文献类型、被引频次及其他相关字段。 #### 实验用计算列 1. 组合嵌入向量（combined_embeddings）或文章嵌入向量（article_embedding）：由出版物标题与摘要拼接后生成的语义嵌入向量，以数值列表形式呈现。 2. 余弦距离（distance_cosine）：出版物嵌入向量与参考嵌入向量（例如基于用户查询生成的参考向量）之间的余弦距离，取值范围为0至1，数值越低代表语义相似度越高。 ### 用途与使用场景本数据集可用于评估基于嵌入模型的聚类、标注及文献计量方法，助力自动化系统综述的实现；同时作为可复现材料，支撑本文所述实验的复刻工作。

创建时间：

2025-06-05