open-index/open-alex
收藏Hugging Face2026-04-09 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/open-index/open-alex
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc0-1.0
task_categories:
- feature-extraction
- text-classification
- question-answering
pretty_name: OpenAlex - Complete Academic Research Database
size_categories:
- 100M<n<1B
source_datasets:
- openalex
tags:
- academic
- research
- scholarly
- citations
- science
- open-access
- parquet
- bibliometrics
- scientometrics
dataset_info:
- config_name: topics
features:
- name: id
dtype: string
- name: display_name
dtype: string
- name: description
dtype: string
- name: keywords
dtype: string
- name: subfield_id
dtype: string
- name: subfield_name
dtype: string
- name: field_id
dtype: string
- name: field_name
dtype: string
- name: domain_id
dtype: string
- name: domain_name
dtype: string
- name: siblings
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: publishers
features:
- name: id
dtype: string
- name: display_name
dtype: string
- name: alternate_titles
dtype: string
- name: hierarchy_level
dtype: int32
- name: parent_publisher
dtype: string
- name: country_codes
dtype: string
- name: homepage_url
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: h_index
dtype: int32
- name: i10_index
dtype: int32
- name: lineage
dtype: string
- name: roles
dtype: string
- name: counts_by_year
dtype: string
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: funders
features:
- name: id
dtype: string
- name: display_name
dtype: string
- name: alternate_titles
dtype: string
- name: country_code
dtype: string
- name: description
dtype: string
- name: homepage_url
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: awards_count
dtype: int32
- name: h_index
dtype: int32
- name: i10_index
dtype: int32
- name: roles
dtype: string
- name: counts_by_year
dtype: string
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: sources
features:
- name: id
dtype: string
- name: issn_l
dtype: string
- name: issn
dtype: string
- name: display_name
dtype: string
- name: type
dtype: string
- name: host_organization
dtype: string
- name: host_organization_name
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: is_oa
dtype: bool
- name: is_in_doaj
dtype: bool
- name: is_core
dtype: bool
- name: homepage_url
dtype: string
- name: country_code
dtype: string
- name: h_index
dtype: int32
- name: i10_index
dtype: int32
- name: apc_usd
dtype: int32
- name: alternate_titles
dtype: string
- name: topics
dtype: string
- name: counts_by_year
dtype: string
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: institutions
features:
- name: id
dtype: string
- name: ror
dtype: string
- name: display_name
dtype: string
- name: type
dtype: string
- name: country_code
dtype: string
- name: homepage_url
dtype: string
- name: image_url
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: h_index
dtype: int32
- name: i10_index
dtype: int32
- name: geo_city
dtype: string
- name: geo_region
dtype: string
- name: geo_country
dtype: string
- name: geo_latitude
dtype: float64
- name: geo_longitude
dtype: float64
- name: associated_institutions
dtype: string
- name: lineage
dtype: string
- name: topics
dtype: string
- name: counts_by_year
dtype: string
- name: roles
dtype: string
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: authors
features:
- name: id
dtype: string
- name: orcid
dtype: string
- name: display_name
dtype: string
- name: display_name_alternatives
dtype: string
- name: works_count
dtype: int32
- name: cited_by_count
dtype: int32
- name: h_index
dtype: int32
- name: i10_index
dtype: int32
- name: two_yr_mean_citedness
dtype: float64
- name: affiliations
dtype: string
- name: last_known_institutions
dtype: string
- name: topics
dtype: string
- name: topic_share
dtype: string
- name: counts_by_year
dtype: string
- name: ids
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
- config_name: works
features:
- name: id
dtype: string
- name: doi
dtype: string
- name: title
dtype: string
- name: publication_year
dtype: int32
- name: publication_date
dtype: string
- name: type
dtype: string
- name: language
dtype: string
- name: is_retracted
dtype: bool
- name: is_paratext
dtype: bool
- name: cited_by_count
dtype: int32
- name: fwci
dtype: float64
- name: referenced_works_count
dtype: int32
- name: authors_count
dtype: int32
- name: locations_count
dtype: int32
- name: is_oa
dtype: bool
- name: oa_status
dtype: string
- name: oa_url
dtype: string
- name: primary_location
dtype: string
- name: best_oa_location
dtype: string
- name: locations
dtype: string
- name: authorships
dtype: string
- name: biblio_volume
dtype: string
- name: biblio_issue
dtype: string
- name: biblio_first_page
dtype: string
- name: biblio_last_page
dtype: string
- name: primary_topic
dtype: string
- name: topics
dtype: string
- name: keywords
dtype: string
- name: referenced_works
dtype: string
- name: related_works
dtype: string
- name: abstract_inverted_index
dtype: string
- name: ids
dtype: string
- name: counts_by_year
dtype: string
- name: sustainable_development_goals
dtype: string
- name: indexed_in
dtype: string
- name: created_date
dtype: string
- name: updated_date
dtype: string
configs:
- config_name: topics
data_files: "data/topics/*.parquet"
- config_name: publishers
data_files: "data/publishers/*.parquet"
- config_name: funders
data_files: "data/funders/*.parquet"
- config_name: sources
data_files: "data/sources/*.parquet"
- config_name: institutions
data_files: "data/institutions/*.parquet"
- config_name: authors
data_files: "data/authors/*.parquet"
- config_name: works
data_files: "data/works/*.parquet"
---
# OpenAlex: Complete Academic Research Database
The world's scholarly research catalog, converted to analysis-ready Parquet. 114.1M records across 7 entity types.
## Table of Contents
- [What is this?](#what-is-this)
- [Quick start](#quick-start)
- [Entity overview](#entity-overview)
- [How entities connect](#how-entities-connect)
- [Schema reference](#schema-reference)
- [Working with abstracts](#working-with-abstracts)
- [Pipeline details](#pipeline-details)
- [Things to know](#things-to-know)
- [Attribution](#attribution)
## What is this?
[OpenAlex](https://openalex.org) is a free, open catalog of the global research system: papers, authors, institutions, journals, topics, publishers, and funders. It's maintained by [OurResearch](https://ourresearch.org/) as the open replacement for the discontinued Microsoft Academic Graph. The index currently covers over 250 million scholarly works with full citation networks, authorship chains, institutional affiliations, and topic classifications.
This dataset is a straight conversion of the [OpenAlex snapshot](https://docs.openalex.org/download-all-data/openalex-snapshot) (2026-04) from gzipped JSON Lines into sharded, ZSTD-compressed Parquet. **114.1M total records** split into files of up to 1 million rows each. You can query it directly with DuckDB (no download needed), stream it with the `datasets` library, or pull specific entities with `huggingface_hub`.
### File layout
```
data/
works/
works-00000.parquet scholarly works (~1M rows each)
works-00001.parquet
...
authors/
authors-00000.parquet researchers and their metrics
...
sources/
sources-00000.parquet journals, repositories, conferences
institutions/
institutions-00000.parquet universities, labs, companies
topics/
topics-00000.parquet research topic taxonomy
publishers/
publishers-00000.parquet academic publishers
funders/
funders-00000.parquet funding organizations
```
Nested fields (authorships, locations, topics, etc.) are stored as JSON strings. Use `json_extract()` in DuckDB or `json.loads()` in Python to work with them.
## Quick start
### DuckDB (no download required)
DuckDB can read Parquet files directly from Hugging Face. This is the fastest way to explore:
```sql
-- Most-cited works of all time
SELECT id, title, publication_year, cited_by_count, doi, oa_status
FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet'
WHERE cited_by_count > 1000
ORDER BY cited_by_count DESC
LIMIT 20;
```
```sql
-- Top authors by h-index
SELECT id, display_name, h_index, i10_index, works_count, cited_by_count
FROM 'hf://datasets/open-index/open-alex/data/authors/*.parquet'
ORDER BY h_index DESC
LIMIT 20;
```
```sql
-- Open access rates by year
SELECT publication_year,
COUNT(*) as total,
SUM(CASE WHEN is_oa THEN 1 ELSE 0 END) as oa_count,
ROUND(100.0 * SUM(CASE WHEN is_oa THEN 1 ELSE 0 END) / COUNT(*), 1) as oa_pct
FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet'
WHERE publication_year BETWEEN 2000 AND 2025
GROUP BY publication_year
ORDER BY publication_year;
```
```sql
-- Top institutions in a country
SELECT display_name, type, geo_city, works_count, cited_by_count, h_index
FROM 'hf://datasets/open-index/open-alex/data/institutions/*.parquet'
WHERE country_code = 'US'
ORDER BY works_count DESC
LIMIT 20;
```
```sql
-- Extract author affiliations from nested JSON
SELECT id, display_name,
json_extract_string(last_known_institutions, '$[0].display_name') as institution,
json_extract_string(last_known_institutions, '$[0].country_code') as country
FROM 'hf://datasets/open-index/open-alex/data/authors/*.parquet'
WHERE last_known_institutions IS NOT NULL
ORDER BY h_index DESC
LIMIT 20;
```
```sql
-- Join works to their first author
SELECT w.title, w.publication_year, w.cited_by_count, a.display_name, a.h_index
FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet' w,
'hf://datasets/open-index/open-alex/data/authors/*.parquet' a
WHERE w.cited_by_count > 5000
AND a.id = json_extract_string(w.authorships, '$[0].author.id')
ORDER BY w.cited_by_count DESC
LIMIT 20;
```
```sql
-- Citation distribution percentiles
SELECT
percentile_disc(0.50) WITHIN GROUP (ORDER BY cited_by_count) AS p50,
percentile_disc(0.90) WITHIN GROUP (ORDER BY cited_by_count) AS p90,
percentile_disc(0.99) WITHIN GROUP (ORDER BY cited_by_count) AS p99,
AVG(cited_by_count) AS mean
FROM read_parquet('hf://datasets/open-index/open-alex/data/works/*.parquet');
```
### Python (datasets library)
```python
from datasets import load_dataset
# Stream works without downloading everything
ds = load_dataset("open-index/open-alex", "works", split="train", streaming=True)
for work in ds:
print(work["id"], work["title"], work["cited_by_count"])
# Load smaller entities into memory
authors = load_dataset("open-index/open-alex", "authors", split="train")
topics = load_dataset("open-index/open-alex", "topics", split="train")
```
### Downloading specific entities
```python
from huggingface_hub import snapshot_download
# Just the small entities (~500 MB total)
snapshot_download(
"open-index/open-alex",
repo_type="dataset",
local_dir="./openalex/",
allow_patterns=["data/topics/*", "data/publishers/*", "data/funders/*",
"data/sources/*", "data/institutions/*"],
)
# For faster downloads:
# pip install huggingface_hub[hf_transfer]
# HF_HUB_ENABLE_HF_TRANSFER=1
```
## Entity overview
| Entity | Records | What's in it |
|---|---|---|
| **Topics** | 4.5K | Research topics with hierarchical classification (domain → field → subfield → topic) |
| **Publishers** | 10.7K | Academic publishers with hierarchy levels and country information |
| **Funders** | 32.4K | Research funding organizations with award counts and cross-references |
| **Sources** | 280.7K | Journals, repositories, conferences, and ebook platforms with ISSN, DOAJ status, and APC pricing |
| **Institutions** | 121.5K | Universities, research centers, companies, and government bodies with ROR IDs and geolocation |
| **Authors** | 113.6M | Researchers with ORCID IDs, h-index, affiliations, and publication statistics |
| **Works** | n/a | Scholarly works (articles, books, datasets) with citations, DOIs, topics, authorships, and open access status |
## How entities connect
Works sit at the center. Everything else links through them.
```
+---------------+
| Works | (central entity)
+-------+-------+
+----------------+------------------+
| | |
+------v------+ +-----v------+ +---------v--------+
| Authorships | | Locations | | Referenced Works |
| (nested) | | (nested) | | (citations) |
+------+------+ +-----+------+ +------------------+
| |
+------v------+ +---v------+
| Authors | | Sources | journals, repos, conferences
+------+------+ +---+------+
| |
+----------v------+ +---v--------+
| Institutions | | Publishers |
+-----------------+ +------------+
Topics: domain > field > subfield > topic (4-level hierarchy)
Funders: linked to works through grants and awards
```
Authorships, locations, and topics are nested as JSON inside works. To join works with their authors, parse the `authorships` field.
## Schema reference
### Topics
Research topics with hierarchical classification (domain → field → subfield → topic).
| Column | Type |
|---|---|
| `id` | string |
| `display_name` | string |
| `description` | string |
| `keywords` | string |
| `subfield_id` | string |
| `subfield_name` | string |
| `field_id` | string |
| `field_name` | string |
| `domain_id` | string |
| `domain_name` | string |
| `siblings` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
### Publishers
Academic publishers with hierarchy levels and country information.
| Column | Type |
|---|---|
| `id` | string |
| `display_name` | string |
| `alternate_titles` | string |
| `hierarchy_level` | int32 |
| `parent_publisher` | string |
| `country_codes` | string |
| `homepage_url` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `h_index` | int32 |
| `i10_index` | int32 |
| `lineage` | string |
| `roles` | string |
| `counts_by_year` | string |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
**Data completeness** (fields below 100%):
| Field | Fill rate | Est. records |
|---|---|---|
| `alternate_titles` | 10.0% | 1.1K |
| `parent_publisher` | 0.0% | 0 |
| `country_codes` | 90.0% | 9.6K |
| `homepage_url` | 80.0% | 8.6K |
| `counts_by_year` | 90.0% | 9.6K |
### Funders
Research funding organizations with award counts and cross-references.
| Column | Type |
|---|---|
| `id` | string |
| `display_name` | string |
| `alternate_titles` | string |
| `country_code` | string |
| `description` | string |
| `homepage_url` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `awards_count` | int32 |
| `h_index` | int32 |
| `i10_index` | int32 |
| `roles` | string |
| `counts_by_year` | string |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
**Data completeness** (fields below 100%):
| Field | Fill rate | Est. records |
|---|---|---|
| `alternate_titles` | 87.5% | 28.4K |
| `description` | 56.2% | 18.2K |
| `homepage_url` | 53.1% | 17.2K |
### Sources
Journals, repositories, conferences, and ebook platforms with ISSN, DOAJ status, and APC pricing.
| Column | Type |
|---|---|
| `id` | string |
| `issn_l` | string |
| `issn` | string |
| `display_name` | string |
| `type` | string |
| `host_organization` | string |
| `host_organization_name` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `is_oa` | bool |
| `is_in_doaj` | bool |
| `is_core` | bool |
| `homepage_url` | string |
| `country_code` | string |
| `h_index` | int32 |
| `i10_index` | int32 |
| `apc_usd` | int32 |
| `alternate_titles` | string |
| `topics` | string |
| `counts_by_year` | string |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
**Data completeness** (fields below 100%):
| Field | Fill rate | Est. records |
|---|---|---|
| `issn_l` | 61.4% | 172.4K |
| `issn` | 61.4% | 172.4K |
| `host_organization` | 25.7% | 72.2K |
| `host_organization_name` | 25.4% | 71.2K |
| `homepage_url` | 26.4% | 74.2K |
| `country_code` | 42.5% | 119.3K |
| `apc_usd` | 3.2% | 9.0K |
| `alternate_titles` | 23.2% | 65.2K |
| `topics` | 92.9% | 260.6K |
| `counts_by_year` | 93.2% | 261.6K |
| `created_date` | 93.2% | 261.6K |
### Institutions
Universities, research centers, companies, and government bodies with ROR IDs and geolocation.
| Column | Type |
|---|---|
| `id` | string |
| `ror` | string |
| `display_name` | string |
| `type` | string |
| `country_code` | string |
| `homepage_url` | string |
| `image_url` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `h_index` | int32 |
| `i10_index` | int32 |
| `geo_city` | string |
| `geo_region` | string |
| `geo_country` | string |
| `geo_latitude` | float64 |
| `geo_longitude` | float64 |
| `associated_institutions` | string |
| `lineage` | string |
| `topics` | string |
| `counts_by_year` | string |
| `roles` | string |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
**Data completeness** (fields below 100%):
| Field | Fill rate | Est. records |
|---|---|---|
| `country_code` | 94.2% | 114.5K |
| `homepage_url` | 99.2% | 120.5K |
| `image_url` | 10.7% | 13.1K |
| `geo_region` | 38.0% | 46.2K |
| `associated_institutions` | 32.2% | 39.2K |
| `topics` | 88.4% | 107.5K |
| `counts_by_year` | 86.8% | 105.4K |
### Authors
Researchers with ORCID IDs, h-index, affiliations, and publication statistics.
| Column | Type |
|---|---|
| `id` | string |
| `orcid` | string |
| `display_name` | string |
| `display_name_alternatives` | string |
| `works_count` | int32 |
| `cited_by_count` | int32 |
| `h_index` | int32 |
| `i10_index` | int32 |
| `two_yr_mean_citedness` | float64 |
| `affiliations` | string |
| `last_known_institutions` | string |
| `topics` | string |
| `topic_share` | string |
| `counts_by_year` | string |
| `ids` | string |
| `created_date` | string |
| `updated_date` | string |
**Data completeness** (fields below 100%):
| Field | Fill rate | Est. records |
|---|---|---|
| `orcid` | 7.1% | 8.0M |
| `affiliations` | 44.5% | 50.6M |
| `last_known_institutions` | 39.4% | 44.7M |
| `topics` | 97.2% | 110.5M |
| `topic_share` | 97.2% | 110.5M |
| `counts_by_year` | 99.6% | 113.2M |
### Works
Scholarly works (articles, books, datasets) with citations, DOIs, topics, authorships, and open access status.
| Column | Type |
|---|---|
| `id` | string |
| `doi` | string |
| `title` | string |
| `publication_year` | int32 |
| `publication_date` | string |
| `type` | string |
| `language` | string |
| `is_retracted` | bool |
| `is_paratext` | bool |
| `cited_by_count` | int32 |
| `fwci` | float64 |
| `referenced_works_count` | int32 |
| `authors_count` | int32 |
| `locations_count` | int32 |
| `is_oa` | bool |
| `oa_status` | string |
| `oa_url` | string |
| `pr
提供机构:
open-index
搜集汇总
数据集介绍

构建方式
在学术文献计量学领域,OpenAlex数据集作为全球研究系统的开放目录,其构建过程体现了对大规模异构学术数据的系统性整合。该数据集源自OpenAlex快照的原始JSON Lines格式,经过精心转换,生成了分片且采用ZSTD压缩的Parquet文件格式,总计包含超过6.065亿条记录。数据涵盖了学术著作、作者、机构、期刊、主题、出版商和资助者七类实体,每类实体均被组织为独立的Parquet文件,最大文件规模控制在百万行以内,确保了数据的高效存储与访问。嵌套字段如作者身份和地理位置以JSON字符串形式保存,为后续的深度解析提供了结构化的基础。
特点
OpenAlex数据集以其全面性和多维关联性著称,在学术知识图谱构建中展现出独特价值。数据集包含492.4万篇学术著作、113.6万位作者以及121.5万家机构等实体,并通过引用网络、作者隶属关系和主题分类形成了紧密的互联结构。其特色在于提供了从领域到子主题的四级层次化分类体系,并集成了开放获取状态、引用计数、h指数等多种文献计量指标。数据以分析就绪的Parquet格式呈现,支持直接查询与流式处理,嵌套字段的JSON存储方式既保留了原始关系的丰富性,又为高效计算提供了可能。
使用方法
针对大规模学术数据分析的需求,OpenAlex数据集提供了灵活多样的使用途径。研究人员可通过DuckDB直接查询存储在Hugging Face上的Parquet文件,无需完整下载即可执行复杂的SQL分析,如计算开放获取率年度趋势或提取高被引著作。利用Hugging Face的datasets库,用户能够以流式方式加载特定实体,实现内存高效的数据处理。对于需要本地副本的场景,huggingface_hub工具支持按实体类型选择性下载。数据集中的嵌套JSON字段可通过json_extract函数或标准JSON解析库进行解构,便于实现跨实体的关联分析,例如将著作与其作者机构进行联结。
背景与挑战
背景概述
在学术信息计量学领域,全面、开放且互联的学术数据库对于推动科学知识图谱的构建与研究评价至关重要。OpenAlex数据集由非营利组织OurResearch于2022年创建,旨在作为已停止服务的微软学术图谱的开放替代品。该数据集整合了全球学术研究系统的核心实体,包括学术成果、作者、机构、期刊、主题、出版商及资助者,覆盖超过2.5亿篇学术作品及其完整的引用网络、作者链、机构归属与主题分类。其核心研究问题聚焦于如何构建一个免费、开放且可互操作的全球学术知识图谱,以支持文献计量学、科学学与开放科学的研究与应用,对促进学术资源的可发现性、透明性与跨学科分析产生了深远影响。
当前挑战
OpenAlex数据集致力于解决学术知识图谱构建与大规模学术数据分析中的关键挑战。在领域问题层面,其挑战在于如何精准实现跨实体(如作者、机构、作品)的复杂关联与归一化,以支撑准确的引用分析、学术影响力评估与科学趋势预测;同时需应对学术出版生态中数据异构性高、更新频繁与开放获取状态动态变化等问题。在构建过程中,挑战体现为从多源异构数据(如期刊元数据、机构注册库、开放存档)中提取、清洗与融合海量记录,并确保数据的一致性、完整性与时效性;此外,将原始JSON Lines格式的庞大数据高效转换为分析就绪的Parquet格式,并处理嵌套字段(如作者归属、地理位置)的存储与查询,亦对数据处理管道提出了严峻的技术要求。
常用场景
经典使用场景
在科学计量学领域,OpenAlex数据集作为全球学术研究系统的开放目录,其经典使用场景体现在对学术影响力的量化评估。研究者通过分析论文的引用网络、作者的h指数以及机构的产出规模,能够绘制出学科发展的知识图谱。该数据集支持对跨领域合作模式、开放获取趋势以及科研基金效益的深入探究,为学术评价提供了全面且动态的视角。
解决学术问题
OpenAlex数据集有效解决了传统学术数据库封闭、碎片化的问题,为科学计量学研究提供了统一且开放的数据基础。它使得大规模引文分析、学科交叉探测以及科研生产力评估成为可能,推动了学术评价体系从简单指标向多维网络的演进。该数据集的出现,促进了开放科学运动,为政策制定者优化科研资源配置提供了实证依据。
衍生相关工作
基于OpenAlex数据集,学术界衍生出诸多经典工作,例如开发了用于预测高影响力论文的机器学习模型、构建动态学科分类体系以及分析全球科研不平等现象。这些研究不仅深化了对科学知识生产规律的理解,也催生了如Litmaps、ResearchRabbit等一批新型学术发现工具,持续推动着科学学与信息计量学的发展。
以上内容由遇见数据集搜集并总结生成



