five

open-index/open-alex

收藏
Hugging Face2026-04-09 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/open-index/open-alex
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc0-1.0 task_categories: - feature-extraction - text-classification - question-answering pretty_name: OpenAlex - Complete Academic Research Database size_categories: - 100M<n<1B source_datasets: - openalex tags: - academic - research - scholarly - citations - science - open-access - parquet - bibliometrics - scientometrics dataset_info: - config_name: topics features: - name: id dtype: string - name: display_name dtype: string - name: description dtype: string - name: keywords dtype: string - name: subfield_id dtype: string - name: subfield_name dtype: string - name: field_id dtype: string - name: field_name dtype: string - name: domain_id dtype: string - name: domain_name dtype: string - name: siblings dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: publishers features: - name: id dtype: string - name: display_name dtype: string - name: alternate_titles dtype: string - name: hierarchy_level dtype: int32 - name: parent_publisher dtype: string - name: country_codes dtype: string - name: homepage_url dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: h_index dtype: int32 - name: i10_index dtype: int32 - name: lineage dtype: string - name: roles dtype: string - name: counts_by_year dtype: string - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: funders features: - name: id dtype: string - name: display_name dtype: string - name: alternate_titles dtype: string - name: country_code dtype: string - name: description dtype: string - name: homepage_url dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: awards_count dtype: int32 - name: h_index dtype: int32 - name: i10_index dtype: int32 - name: roles dtype: string - name: counts_by_year dtype: string - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: sources features: - name: id dtype: string - name: issn_l dtype: string - name: issn dtype: string - name: display_name dtype: string - name: type dtype: string - name: host_organization dtype: string - name: host_organization_name dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: is_oa dtype: bool - name: is_in_doaj dtype: bool - name: is_core dtype: bool - name: homepage_url dtype: string - name: country_code dtype: string - name: h_index dtype: int32 - name: i10_index dtype: int32 - name: apc_usd dtype: int32 - name: alternate_titles dtype: string - name: topics dtype: string - name: counts_by_year dtype: string - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: institutions features: - name: id dtype: string - name: ror dtype: string - name: display_name dtype: string - name: type dtype: string - name: country_code dtype: string - name: homepage_url dtype: string - name: image_url dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: h_index dtype: int32 - name: i10_index dtype: int32 - name: geo_city dtype: string - name: geo_region dtype: string - name: geo_country dtype: string - name: geo_latitude dtype: float64 - name: geo_longitude dtype: float64 - name: associated_institutions dtype: string - name: lineage dtype: string - name: topics dtype: string - name: counts_by_year dtype: string - name: roles dtype: string - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: authors features: - name: id dtype: string - name: orcid dtype: string - name: display_name dtype: string - name: display_name_alternatives dtype: string - name: works_count dtype: int32 - name: cited_by_count dtype: int32 - name: h_index dtype: int32 - name: i10_index dtype: int32 - name: two_yr_mean_citedness dtype: float64 - name: affiliations dtype: string - name: last_known_institutions dtype: string - name: topics dtype: string - name: topic_share dtype: string - name: counts_by_year dtype: string - name: ids dtype: string - name: created_date dtype: string - name: updated_date dtype: string - config_name: works features: - name: id dtype: string - name: doi dtype: string - name: title dtype: string - name: publication_year dtype: int32 - name: publication_date dtype: string - name: type dtype: string - name: language dtype: string - name: is_retracted dtype: bool - name: is_paratext dtype: bool - name: cited_by_count dtype: int32 - name: fwci dtype: float64 - name: referenced_works_count dtype: int32 - name: authors_count dtype: int32 - name: locations_count dtype: int32 - name: is_oa dtype: bool - name: oa_status dtype: string - name: oa_url dtype: string - name: primary_location dtype: string - name: best_oa_location dtype: string - name: locations dtype: string - name: authorships dtype: string - name: biblio_volume dtype: string - name: biblio_issue dtype: string - name: biblio_first_page dtype: string - name: biblio_last_page dtype: string - name: primary_topic dtype: string - name: topics dtype: string - name: keywords dtype: string - name: referenced_works dtype: string - name: related_works dtype: string - name: abstract_inverted_index dtype: string - name: ids dtype: string - name: counts_by_year dtype: string - name: sustainable_development_goals dtype: string - name: indexed_in dtype: string - name: created_date dtype: string - name: updated_date dtype: string configs: - config_name: topics data_files: "data/topics/*.parquet" - config_name: publishers data_files: "data/publishers/*.parquet" - config_name: funders data_files: "data/funders/*.parquet" - config_name: sources data_files: "data/sources/*.parquet" - config_name: institutions data_files: "data/institutions/*.parquet" - config_name: authors data_files: "data/authors/*.parquet" - config_name: works data_files: "data/works/*.parquet" --- # OpenAlex: Complete Academic Research Database The world's scholarly research catalog, converted to analysis-ready Parquet. 114.1M records across 7 entity types. ## Table of Contents - [What is this?](#what-is-this) - [Quick start](#quick-start) - [Entity overview](#entity-overview) - [How entities connect](#how-entities-connect) - [Schema reference](#schema-reference) - [Working with abstracts](#working-with-abstracts) - [Pipeline details](#pipeline-details) - [Things to know](#things-to-know) - [Attribution](#attribution) ## What is this? [OpenAlex](https://openalex.org) is a free, open catalog of the global research system: papers, authors, institutions, journals, topics, publishers, and funders. It's maintained by [OurResearch](https://ourresearch.org/) as the open replacement for the discontinued Microsoft Academic Graph. The index currently covers over 250 million scholarly works with full citation networks, authorship chains, institutional affiliations, and topic classifications. This dataset is a straight conversion of the [OpenAlex snapshot](https://docs.openalex.org/download-all-data/openalex-snapshot) (2026-04) from gzipped JSON Lines into sharded, ZSTD-compressed Parquet. **114.1M total records** split into files of up to 1 million rows each. You can query it directly with DuckDB (no download needed), stream it with the `datasets` library, or pull specific entities with `huggingface_hub`. ### File layout ``` data/ works/ works-00000.parquet scholarly works (~1M rows each) works-00001.parquet ... authors/ authors-00000.parquet researchers and their metrics ... sources/ sources-00000.parquet journals, repositories, conferences institutions/ institutions-00000.parquet universities, labs, companies topics/ topics-00000.parquet research topic taxonomy publishers/ publishers-00000.parquet academic publishers funders/ funders-00000.parquet funding organizations ``` Nested fields (authorships, locations, topics, etc.) are stored as JSON strings. Use `json_extract()` in DuckDB or `json.loads()` in Python to work with them. ## Quick start ### DuckDB (no download required) DuckDB can read Parquet files directly from Hugging Face. This is the fastest way to explore: ```sql -- Most-cited works of all time SELECT id, title, publication_year, cited_by_count, doi, oa_status FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet' WHERE cited_by_count > 1000 ORDER BY cited_by_count DESC LIMIT 20; ``` ```sql -- Top authors by h-index SELECT id, display_name, h_index, i10_index, works_count, cited_by_count FROM 'hf://datasets/open-index/open-alex/data/authors/*.parquet' ORDER BY h_index DESC LIMIT 20; ``` ```sql -- Open access rates by year SELECT publication_year, COUNT(*) as total, SUM(CASE WHEN is_oa THEN 1 ELSE 0 END) as oa_count, ROUND(100.0 * SUM(CASE WHEN is_oa THEN 1 ELSE 0 END) / COUNT(*), 1) as oa_pct FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet' WHERE publication_year BETWEEN 2000 AND 2025 GROUP BY publication_year ORDER BY publication_year; ``` ```sql -- Top institutions in a country SELECT display_name, type, geo_city, works_count, cited_by_count, h_index FROM 'hf://datasets/open-index/open-alex/data/institutions/*.parquet' WHERE country_code = 'US' ORDER BY works_count DESC LIMIT 20; ``` ```sql -- Extract author affiliations from nested JSON SELECT id, display_name, json_extract_string(last_known_institutions, '$[0].display_name') as institution, json_extract_string(last_known_institutions, '$[0].country_code') as country FROM 'hf://datasets/open-index/open-alex/data/authors/*.parquet' WHERE last_known_institutions IS NOT NULL ORDER BY h_index DESC LIMIT 20; ``` ```sql -- Join works to their first author SELECT w.title, w.publication_year, w.cited_by_count, a.display_name, a.h_index FROM 'hf://datasets/open-index/open-alex/data/works/*.parquet' w, 'hf://datasets/open-index/open-alex/data/authors/*.parquet' a WHERE w.cited_by_count > 5000 AND a.id = json_extract_string(w.authorships, '$[0].author.id') ORDER BY w.cited_by_count DESC LIMIT 20; ``` ```sql -- Citation distribution percentiles SELECT percentile_disc(0.50) WITHIN GROUP (ORDER BY cited_by_count) AS p50, percentile_disc(0.90) WITHIN GROUP (ORDER BY cited_by_count) AS p90, percentile_disc(0.99) WITHIN GROUP (ORDER BY cited_by_count) AS p99, AVG(cited_by_count) AS mean FROM read_parquet('hf://datasets/open-index/open-alex/data/works/*.parquet'); ``` ### Python (datasets library) ```python from datasets import load_dataset # Stream works without downloading everything ds = load_dataset("open-index/open-alex", "works", split="train", streaming=True) for work in ds: print(work["id"], work["title"], work["cited_by_count"]) # Load smaller entities into memory authors = load_dataset("open-index/open-alex", "authors", split="train") topics = load_dataset("open-index/open-alex", "topics", split="train") ``` ### Downloading specific entities ```python from huggingface_hub import snapshot_download # Just the small entities (~500 MB total) snapshot_download( "open-index/open-alex", repo_type="dataset", local_dir="./openalex/", allow_patterns=["data/topics/*", "data/publishers/*", "data/funders/*", "data/sources/*", "data/institutions/*"], ) # For faster downloads: # pip install huggingface_hub[hf_transfer] # HF_HUB_ENABLE_HF_TRANSFER=1 ``` ## Entity overview | Entity | Records | What's in it | |---|---|---| | **Topics** | 4.5K | Research topics with hierarchical classification (domain → field → subfield → topic) | | **Publishers** | 10.7K | Academic publishers with hierarchy levels and country information | | **Funders** | 32.4K | Research funding organizations with award counts and cross-references | | **Sources** | 280.7K | Journals, repositories, conferences, and ebook platforms with ISSN, DOAJ status, and APC pricing | | **Institutions** | 121.5K | Universities, research centers, companies, and government bodies with ROR IDs and geolocation | | **Authors** | 113.6M | Researchers with ORCID IDs, h-index, affiliations, and publication statistics | | **Works** | n/a | Scholarly works (articles, books, datasets) with citations, DOIs, topics, authorships, and open access status | ## How entities connect Works sit at the center. Everything else links through them. ``` +---------------+ | Works | (central entity) +-------+-------+ +----------------+------------------+ | | | +------v------+ +-----v------+ +---------v--------+ | Authorships | | Locations | | Referenced Works | | (nested) | | (nested) | | (citations) | +------+------+ +-----+------+ +------------------+ | | +------v------+ +---v------+ | Authors | | Sources | journals, repos, conferences +------+------+ +---+------+ | | +----------v------+ +---v--------+ | Institutions | | Publishers | +-----------------+ +------------+ Topics: domain > field > subfield > topic (4-level hierarchy) Funders: linked to works through grants and awards ``` Authorships, locations, and topics are nested as JSON inside works. To join works with their authors, parse the `authorships` field. ## Schema reference ### Topics Research topics with hierarchical classification (domain → field → subfield → topic). | Column | Type | |---|---| | `id` | string | | `display_name` | string | | `description` | string | | `keywords` | string | | `subfield_id` | string | | `subfield_name` | string | | `field_id` | string | | `field_name` | string | | `domain_id` | string | | `domain_name` | string | | `siblings` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `ids` | string | | `created_date` | string | | `updated_date` | string | ### Publishers Academic publishers with hierarchy levels and country information. | Column | Type | |---|---| | `id` | string | | `display_name` | string | | `alternate_titles` | string | | `hierarchy_level` | int32 | | `parent_publisher` | string | | `country_codes` | string | | `homepage_url` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `h_index` | int32 | | `i10_index` | int32 | | `lineage` | string | | `roles` | string | | `counts_by_year` | string | | `ids` | string | | `created_date` | string | | `updated_date` | string | **Data completeness** (fields below 100%): | Field | Fill rate | Est. records | |---|---|---| | `alternate_titles` | 10.0% | 1.1K | | `parent_publisher` | 0.0% | 0 | | `country_codes` | 90.0% | 9.6K | | `homepage_url` | 80.0% | 8.6K | | `counts_by_year` | 90.0% | 9.6K | ### Funders Research funding organizations with award counts and cross-references. | Column | Type | |---|---| | `id` | string | | `display_name` | string | | `alternate_titles` | string | | `country_code` | string | | `description` | string | | `homepage_url` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `awards_count` | int32 | | `h_index` | int32 | | `i10_index` | int32 | | `roles` | string | | `counts_by_year` | string | | `ids` | string | | `created_date` | string | | `updated_date` | string | **Data completeness** (fields below 100%): | Field | Fill rate | Est. records | |---|---|---| | `alternate_titles` | 87.5% | 28.4K | | `description` | 56.2% | 18.2K | | `homepage_url` | 53.1% | 17.2K | ### Sources Journals, repositories, conferences, and ebook platforms with ISSN, DOAJ status, and APC pricing. | Column | Type | |---|---| | `id` | string | | `issn_l` | string | | `issn` | string | | `display_name` | string | | `type` | string | | `host_organization` | string | | `host_organization_name` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `is_oa` | bool | | `is_in_doaj` | bool | | `is_core` | bool | | `homepage_url` | string | | `country_code` | string | | `h_index` | int32 | | `i10_index` | int32 | | `apc_usd` | int32 | | `alternate_titles` | string | | `topics` | string | | `counts_by_year` | string | | `ids` | string | | `created_date` | string | | `updated_date` | string | **Data completeness** (fields below 100%): | Field | Fill rate | Est. records | |---|---|---| | `issn_l` | 61.4% | 172.4K | | `issn` | 61.4% | 172.4K | | `host_organization` | 25.7% | 72.2K | | `host_organization_name` | 25.4% | 71.2K | | `homepage_url` | 26.4% | 74.2K | | `country_code` | 42.5% | 119.3K | | `apc_usd` | 3.2% | 9.0K | | `alternate_titles` | 23.2% | 65.2K | | `topics` | 92.9% | 260.6K | | `counts_by_year` | 93.2% | 261.6K | | `created_date` | 93.2% | 261.6K | ### Institutions Universities, research centers, companies, and government bodies with ROR IDs and geolocation. | Column | Type | |---|---| | `id` | string | | `ror` | string | | `display_name` | string | | `type` | string | | `country_code` | string | | `homepage_url` | string | | `image_url` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `h_index` | int32 | | `i10_index` | int32 | | `geo_city` | string | | `geo_region` | string | | `geo_country` | string | | `geo_latitude` | float64 | | `geo_longitude` | float64 | | `associated_institutions` | string | | `lineage` | string | | `topics` | string | | `counts_by_year` | string | | `roles` | string | | `ids` | string | | `created_date` | string | | `updated_date` | string | **Data completeness** (fields below 100%): | Field | Fill rate | Est. records | |---|---|---| | `country_code` | 94.2% | 114.5K | | `homepage_url` | 99.2% | 120.5K | | `image_url` | 10.7% | 13.1K | | `geo_region` | 38.0% | 46.2K | | `associated_institutions` | 32.2% | 39.2K | | `topics` | 88.4% | 107.5K | | `counts_by_year` | 86.8% | 105.4K | ### Authors Researchers with ORCID IDs, h-index, affiliations, and publication statistics. | Column | Type | |---|---| | `id` | string | | `orcid` | string | | `display_name` | string | | `display_name_alternatives` | string | | `works_count` | int32 | | `cited_by_count` | int32 | | `h_index` | int32 | | `i10_index` | int32 | | `two_yr_mean_citedness` | float64 | | `affiliations` | string | | `last_known_institutions` | string | | `topics` | string | | `topic_share` | string | | `counts_by_year` | string | | `ids` | string | | `created_date` | string | | `updated_date` | string | **Data completeness** (fields below 100%): | Field | Fill rate | Est. records | |---|---|---| | `orcid` | 7.1% | 8.0M | | `affiliations` | 44.5% | 50.6M | | `last_known_institutions` | 39.4% | 44.7M | | `topics` | 97.2% | 110.5M | | `topic_share` | 97.2% | 110.5M | | `counts_by_year` | 99.6% | 113.2M | ### Works Scholarly works (articles, books, datasets) with citations, DOIs, topics, authorships, and open access status. | Column | Type | |---|---| | `id` | string | | `doi` | string | | `title` | string | | `publication_year` | int32 | | `publication_date` | string | | `type` | string | | `language` | string | | `is_retracted` | bool | | `is_paratext` | bool | | `cited_by_count` | int32 | | `fwci` | float64 | | `referenced_works_count` | int32 | | `authors_count` | int32 | | `locations_count` | int32 | | `is_oa` | bool | | `oa_status` | string | | `oa_url` | string | | `pr
提供机构:
open-index
搜集汇总
数据集介绍
main_image_url
构建方式
在学术文献计量学领域,OpenAlex数据集作为全球研究系统的开放目录,其构建过程体现了对大规模异构学术数据的系统性整合。该数据集源自OpenAlex快照的原始JSON Lines格式,经过精心转换,生成了分片且采用ZSTD压缩的Parquet文件格式,总计包含超过6.065亿条记录。数据涵盖了学术著作、作者、机构、期刊、主题、出版商和资助者七类实体,每类实体均被组织为独立的Parquet文件,最大文件规模控制在百万行以内,确保了数据的高效存储与访问。嵌套字段如作者身份和地理位置以JSON字符串形式保存,为后续的深度解析提供了结构化的基础。
特点
OpenAlex数据集以其全面性和多维关联性著称,在学术知识图谱构建中展现出独特价值。数据集包含492.4万篇学术著作、113.6万位作者以及121.5万家机构等实体,并通过引用网络、作者隶属关系和主题分类形成了紧密的互联结构。其特色在于提供了从领域到子主题的四级层次化分类体系,并集成了开放获取状态、引用计数、h指数等多种文献计量指标。数据以分析就绪的Parquet格式呈现,支持直接查询与流式处理,嵌套字段的JSON存储方式既保留了原始关系的丰富性,又为高效计算提供了可能。
使用方法
针对大规模学术数据分析的需求,OpenAlex数据集提供了灵活多样的使用途径。研究人员可通过DuckDB直接查询存储在Hugging Face上的Parquet文件,无需完整下载即可执行复杂的SQL分析,如计算开放获取率年度趋势或提取高被引著作。利用Hugging Face的datasets库,用户能够以流式方式加载特定实体,实现内存高效的数据处理。对于需要本地副本的场景,huggingface_hub工具支持按实体类型选择性下载。数据集中的嵌套JSON字段可通过json_extract函数或标准JSON解析库进行解构,便于实现跨实体的关联分析,例如将著作与其作者机构进行联结。
背景与挑战
背景概述
在学术信息计量学领域,全面、开放且互联的学术数据库对于推动科学知识图谱的构建与研究评价至关重要。OpenAlex数据集由非营利组织OurResearch于2022年创建,旨在作为已停止服务的微软学术图谱的开放替代品。该数据集整合了全球学术研究系统的核心实体,包括学术成果、作者、机构、期刊、主题、出版商及资助者,覆盖超过2.5亿篇学术作品及其完整的引用网络、作者链、机构归属与主题分类。其核心研究问题聚焦于如何构建一个免费、开放且可互操作的全球学术知识图谱,以支持文献计量学、科学学与开放科学的研究与应用,对促进学术资源的可发现性、透明性与跨学科分析产生了深远影响。
当前挑战
OpenAlex数据集致力于解决学术知识图谱构建与大规模学术数据分析中的关键挑战。在领域问题层面,其挑战在于如何精准实现跨实体(如作者、机构、作品)的复杂关联与归一化,以支撑准确的引用分析、学术影响力评估与科学趋势预测;同时需应对学术出版生态中数据异构性高、更新频繁与开放获取状态动态变化等问题。在构建过程中,挑战体现为从多源异构数据(如期刊元数据、机构注册库、开放存档)中提取、清洗与融合海量记录,并确保数据的一致性、完整性与时效性;此外,将原始JSON Lines格式的庞大数据高效转换为分析就绪的Parquet格式,并处理嵌套字段(如作者归属、地理位置)的存储与查询,亦对数据处理管道提出了严峻的技术要求。
常用场景
经典使用场景
在科学计量学领域,OpenAlex数据集作为全球学术研究系统的开放目录,其经典使用场景体现在对学术影响力的量化评估。研究者通过分析论文的引用网络、作者的h指数以及机构的产出规模,能够绘制出学科发展的知识图谱。该数据集支持对跨领域合作模式、开放获取趋势以及科研基金效益的深入探究,为学术评价提供了全面且动态的视角。
解决学术问题
OpenAlex数据集有效解决了传统学术数据库封闭、碎片化的问题,为科学计量学研究提供了统一且开放的数据基础。它使得大规模引文分析、学科交叉探测以及科研生产力评估成为可能,推动了学术评价体系从简单指标向多维网络的演进。该数据集的出现,促进了开放科学运动,为政策制定者优化科研资源配置提供了实证依据。
衍生相关工作
基于OpenAlex数据集,学术界衍生出诸多经典工作,例如开发了用于预测高影响力论文的机器学习模型、构建动态学科分类体系以及分析全球科研不平等现象。这些研究不仅深化了对科学知识生产规律的理解,也催生了如Litmaps、ResearchRabbit等一批新型学术发现工具,持续推动着科学学与信息计量学的发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作