Classification of research publications based on data from OpenAlex
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10560275
下载链接
链接失效反馈官方服务:
资源简介:
This data set contains an algorithmic classification of research publications based on data from OpenAlex. The classification is based on the OpenAlex snapshot released on November 21, 2023.
To build the classification, we used the so-called extended direct citation approach in combination with the Leiden algorithm. The source code of our software is available here. The classification covers the 71 million journal articles, proceedings papers, preprints, and book chapters in OpenAlex that were published between 2000 and 2023 and that are connected to each other by citation links. Based on 1715 million citation links, we built a three-level hierarchical classification. Each publication was assigned to a cluster at each of the three levels of the classification. Clusters consist of publications that are relatively strongly connected by citation links and that can therefore be expected to be topically related. At each level of the classification, a publication was assigned to only one cluster, which means clusters do not overlap.
The classification consists of 4521 micro clusters at the lowest (most granular) level, 917 meso clusters at the middle level, and 20 macro clusters at the highest (least granular) level. We also algorithmically linked each cluster in the classification to one or more of the following five broad main fields: biomedical and health sciences, life and earth sciences, mathematics and computer science, physical sciences and engineering, and social sciences and humanities.
We used the Updated GPT 3.5 Turbo large language model, developed by OpenAI, to label the 4521 micro clusters at the lowest level in the classification. The source code of our software can be found here.
See this blog post for more information about the classification.
The classification, including the labels of the micro clusters, is available in the following tab-delimited files.
clustering.tsv
work_id
doi
macro_cluster_id
meso_cluster_id
micro_cluster_id
main_field.tsv
main_field_id
main_field
macro_cluster.tsv
macro_cluster_id
macro_cluster
n_works
macro_cluster_main_field.tsv
macro_cluster_id
main_field_seq
main_field_id
weight
is_primary_main_field
meso_cluster.tsv
meso_cluster_id
meso_cluster
parent_macro_cluster_id
n_works
meso_cluster_main_field.tsv
meso_cluster_id
main_field_seq
main_field_id
weight
is_primary_main_field
meso_cluster_source.tsv
meso_cluster_id
source_seq
source_id
n_works
micro_cluster.tsv
micro_cluster_id
micro_cluster
short_label
long_label
keywords
summary
wikipedia_url
parent_macro_cluster_id
parent_meso_cluster_id
n_works
micro_cluster_main_field.tsv
micro_cluster_id
main_field_seq
main_field_id
weight
is_primary_main_field
micro_cluster_keyword.tsv
micro_cluster_id
keyword_seq
keyword
micro_cluster_source.tsv
micro_cluster_id
source_seq
source_id
n_works
创建时间:
2024-01-24



