J0nasW/science-datalake

Name: J0nasW/science-datalake
Creator: J0nasW
Published: 2026-04-11 07:01:51
License: 暂无描述

Hugging Face2026-04-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/J0nasW/science-datalake

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: - cc0-1.0 - cc-by-4.0 - cc-by-sa-4.0 - cc-by-nc-sa-4.0 - cc-by-nc-4.0 size_categories: - 100M<n<1B task_categories: - text-classification - feature-extraction tags: - scholarly - academic - citations - bibliometrics - science-of-science - openalex - sciscinet - papers-with-code - duckdb - parquet - ontologies - knowledge-graph pretty_name: Science Data Lake thumbnail: https://raw.githubusercontent.com/J0nasW/science-datalake/main/sdl_banner.jpg configs: # Cross-reference tables - config_name: unified_papers data_files: "xref/unified_papers/*.parquet" - config_name: topic_ontology_map data_files: "xref/topic_ontology_map/*.parquet" - config_name: ontology_bridges data_files: "xref/ontology_bridges/*.parquet" # OpenAlex (CC0 1.0) - config_name: openalex_works data_files: "openalex/works/*.parquet" - config_name: openalex_authors data_files: "openalex/authors/*.parquet" - config_name: openalex_topics data_files: "openalex/topics/*.parquet" - config_name: openalex_works_topics data_files: "openalex/works_topics/*.parquet" - config_name: openalex_works_authorships data_files: "openalex/works_authorships/*.parquet" - config_name: openalex_works_referenced_works data_files: "openalex/works_referenced_works/*.parquet" - config_name: openalex_works_keywords data_files: "openalex/works_keywords/*.parquet" - config_name: openalex_institutions data_files: "openalex/institutions/*.parquet" # SciSciNet (CC BY 4.0) - config_name: sciscinet_core data_files: "sciscinet/core/*.parquet" - config_name: sciscinet_large data_files: "sciscinet/large/*.parquet" # Papers With Code (CC BY-SA 4.0) - config_name: pwc_papers data_files: "pwc/papers/*.parquet" - config_name: pwc_paper_has_code data_files: "pwc/paper_has_code/*.parquet" - config_name: pwc_methods data_files: "pwc/methods/*.parquet" - config_name: pwc_paper_has_task data_files: "pwc/paper_has_task/*.parquet" - config_name: pwc_datasets data_files: "pwc/datasets/*.parquet" # Other sources - config_name: retwatch data_files: "retwatch/retraction_watch/*.parquet" - config_name: p2p_preprint_to_paper data_files: "p2p/preprint_to_paper/*.parquet" # Reliance on Science (CC BY-NC 4.0) - config_name: ros_patent_paper_pairs data_files: "ros/patent_paper_pairs/*.parquet" - config_name: ros_patent_paper_pairs_plus data_files: "ros/patent_paper_pairs_plus/*.parquet" - config_name: ros_pcs_oa data_files: "ros/pcs_oa/*.parquet" # Ontologies (various licenses, see below) - config_name: ontology_terms data_files: "ontologies/*_terms.parquet" - config_name: ontology_hierarchy data_files: "ontologies/*_hierarchy.parquet" - config_name: ontology_xrefs data_files: "ontologies/*_xrefs.parquet" --- <p align="center"> <img src="https://raw.githubusercontent.com/J0nasW/science-datalake/main/sdl_banner.jpg" alt="Science Data Lake" width="100%"> </p> <p align="center"> <a href="https://arxiv.org/abs/2603.03126"><img src="https://img.shields.io/badge/arXiv-2603.03126-b31b1b" alt="arXiv"></a> <a href="https://github.com/J0nasW/science-datalake"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"></a> <a href="https://doi.org/10.57967/hf/7850"><img src="https://img.shields.io/badge/DOI-10.57967%2Fhf%2F7850-blue" alt="DOI"></a> <a href="https://github.com/J0nasW/science-datalake/blob/main/SCHEMA.md"><img src="https://img.shields.io/badge/LLM--Ready-SCHEMA.md-purple" alt="LLM-Ready"></a> <a href="https://x.com/Jonas_H_W"><img src="https://img.shields.io/badge/Follow-%40Jonas__H__W-black?logo=x" alt="Follow on X"></a> <a href="https://wilinski.me"><img src="https://img.shields.io/badge/Author-wilinski.me-orange" alt="Author website"></a> </p> # Science Data Lake A unified, portable science data lake integrating **7 scholarly datasets** (~525 GB Parquet) with cross-dataset DOI normalization, **13 scientific ontologies** (1.3M terms), and a reproducible ETL pipeline. > **Note:** One additional source (Semantic Scholar S2AG) is supported by the pipeline but is **not redistributed here** due to its API terms of service. See [Not Included in This Upload](#not-included-in-this-upload) below. ## What's Unique This dataset enables queries that are **impossible with any single source**: ```sql -- "Top disruptive papers with open-source code, checking for retractions" SELECT doi, title, year, sciscinet_disruption, -- from SciSciNet oa_cited_by_count, -- from OpenAlex has_pwc, -- from Papers With Code has_retraction -- from Retraction Watch FROM unified_papers WHERE has_pwc AND sciscinet_disruption > 0.5 ORDER BY oa_cited_by_count DESC LIMIT 20 ``` ## Datasets Included | Dataset | Papers/Records | License | Key Contribution | |---------|---------------|---------|-----------------| | **OpenAlex** | 479M works | **CC0 1.0** (public domain) | Broadest coverage, topics, FWCI | | **SciSciNet** v2 | 250M papers | **CC BY 4.0** | Disruption index, atypicality, team size | | **Papers With Code** | 513K papers | **CC BY-SA 4.0** | Method-task-dataset-code links | | **Retraction Watch** | 69K records | **Open** (via Crossref) | Retraction flags + reasons | | **Reliance on Science** | 47.8M pairs | **CC BY-NC 4.0** | Patent-to-paper citation pairs (global) | | **Preprint-to-Paper** | 146K pairs | **CC BY 4.0** | bioRxiv preprint to published paper | | **13 Ontologies** | 1.3M terms | Various (see below) | CSO, MeSH, GO, DOID, ChEBI, NCIT, HPO, EDAM, AGROVOC, UNESCO, STW, MSC2020, PhySH | ### Ontology Licenses | Ontology | License | |----------|---------| | MeSH | Public Domain (US government work) | | GO, ChEBI, NCIT, EDAM, CSO, PhySH, STW | CC BY 4.0 | | DOID | CC0 1.0 | | AGROVOC | CC BY 3.0 IGO | | UNESCO Thesaurus | CC BY-SA 3.0 IGO | | HPO | Custom (free for research use) | | MSC2020 | **CC BY-NC-SA 4.0** (non-commercial) | ### Snapshot Dates Each source was downloaded at a specific point in time: | Dataset | Snapshot / Release | Notes | |---------|-------------------|-------| | OpenAlex | 2026-02-03 | S3 snapshot | | SciSciNet v2 | 2024-11-01 | GCS bucket | | Papers With Code | 2025-07 | Archived JSON | | Retraction Watch | 2025-02 | Crossref CSV | | Reliance on Science | v64 | Zenodo record | | Preprint-to-Paper | 2025-06 | Zenodo record | | 13 Ontologies | 2026-02 | Official sources | All snapshots can be refreshed using the [update pipeline](https://github.com/J0nasW/science-datalake) — see below. ### Not Included in This Upload The following source is supported by the full pipeline ([GitHub](https://github.com/J0nasW/science-datalake)) but is **not redistributed here** due to its API terms of service: | Dataset | Reason | How to obtain | |---------|--------|---------------| | **S2AG** (Semantic Scholar, 231M papers) | License requires individual agreement with Semantic Scholar | [Semantic Scholar Datasets API](https://api.semanticscholar.org/api-docs/datasets) | After downloading S2AG locally, run the full pipeline to integrate it. ## Key Tables ### `unified_papers` (293M rows) The headline table: one row per unique DOI, joining all sources. | Column | Type | Description | |--------|------|-------------| | `doi` | VARCHAR | Normalized DOI (lowercase, no prefix) | | `title` | VARCHAR | Best available title (OpenAlex > S2AG) | | `year` | BIGINT | Publication year | | `openalex_id` | VARCHAR | OpenAlex work ID | | `sciscinet_paperid` | VARCHAR | SciSciNet paper ID | | `has_openalex` | BOOLEAN | Present in OpenAlex | | `has_sciscinet` | BOOLEAN | Present in SciSciNet | | `has_pwc` | BOOLEAN | Has code on Papers With Code | | `has_retraction` | BOOLEAN | Flagged in Retraction Watch | | `has_s2ag` | BOOLEAN | Present in Semantic Scholar | | `has_patent` | BOOLEAN | Cited by at least one patent (RoS) | | `s2ag_corpusid` | BIGINT | Semantic Scholar corpus ID | | `s2ag_citationcount` | INTEGER | S2AG citation count | | `oa_cited_by_count` | BIGINT | OpenAlex citation count | | `sciscinet_disruption` | DOUBLE | Disruption index (CD index) | | `sciscinet_atypicality` | DOUBLE | Atypicality score | | `oa_fwci` | DOUBLE | Field-Weighted Citation Impact | > **Note:** The S2AG columns (`s2ag_corpusid`, `s2ag_citationcount`, `s2ag_influentialcitationcount`, `s2ag_isopenaccess`, `has_s2ag`) are present in the uploaded file but will contain NULL/FALSE values unless S2AG has been integrated locally. All other columns (including `has_patent` from Reliance on Science) are fully populated. ### `topic_ontology_map` Maps OpenAlex's 4,516 topics to terms in 13 scientific ontologies via embedding-based semantic similarity (BGE-large-en-v1.5, 1024-dim) + exact matching for large ontologies (MeSH, ChEBI, NCIT). 16,150 mappings covering 99.8% of topics. Columns include `similarity` (cosine, 0-1) and `match_type` (label/synonym/exact) for quality filtering. ### `ontology_bridges` Cross-ontology links discovered via shared external IDs (UMLS, Wikidata, MESH, etc.). ## Usage with DuckDB ### Option 1: Pre-built database file (recommended) This repository includes a ready-to-use DuckDB database file (`datalake.duckdb`, 274 KB) with 145 SQL views pre-configured to read directly from HuggingFace. Download just this one file and query all 7 datasets immediately — no pipeline setup required. ```python import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") con.execute("ATTACH 'hf://datasets/J0nasW/science-datalake/datalake.duckdb' AS lake") # Query using familiar schema.table syntax df = con.execute(""" SELECT doi, title, year, sciscinet_disruption, oa_cited_by_count FROM lake.xref.unified_papers WHERE sciscinet_disruption IS NOT NULL ORDER BY sciscinet_disruption DESC LIMIT 100 """).df() # Cross-source joins work out of the box con.execute(""" SELECT t.display_name AS topic, o.ontology, o.term_name, o.similarity FROM lake.xref.topic_ontology_map o JOIN lake.openalex.topics t ON t.id = o.topic_id WHERE o.similarity >= 0.85 ORDER BY o.similarity DESC LIMIT 20 """).df() ``` ### Option 2: Direct Parquet queries You can also query individual Parquet files directly without the database file: ```python import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") df = con.execute(""" SELECT doi, title, year, sciscinet_disruption, oa_cited_by_count FROM 'hf://datasets/J0nasW/science-datalake/xref/unified_papers/*.parquet' WHERE sciscinet_disruption IS NOT NULL ORDER BY sciscinet_disruption DESC LIMIT 100 """).df() ``` ## Keeping the Data Current The full pipeline supports incremental updates. When upstream sources release new snapshots: ```bash # Update a single dataset python scripts/datalake_cli.py update openalex # Update all datasets and rebuild cross-reference tables python scripts/datalake_cli.py update python scripts/materialize_unified_papers.py ``` See the [GitHub repository](https://github.com/J0nasW/science-datalake) for full pipeline documentation. ## LLM & AI Agent Integration This data lake ships with **[SCHEMA.md](https://github.com/J0nasW/science-datalake/blob/main/SCHEMA.md)** — a structured reference file optimized for LLM-based coding agents (Claude Code, Cursor, Copilot, etc.). It contains every table, column, type, join strategy, and performance tier in a format that AI agents can use to write correct DuckDB SQL without prior schema knowledge. Point your AI assistant at `SCHEMA.md` and ask it to query across all 7 hosted datasets and 13 ontologies using natural language. ## Building the Full Instance (All 8 Sources) Clone the GitHub repository and run the pipeline to integrate all sources including S2AG: ```bash git clone https://github.com/J0nasW/science-datalake cd science-datalake python scripts/datalake_cli.py download --all python scripts/datalake_cli.py convert --all python scripts/create_unified_db.py python scripts/materialize_unified_papers.py ``` ## Citation If you use the Science Data Lake, please cite the paper: ```bibtex @article{wilinski2026sciencedatalake, title = {The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment}, author = {Wilinski, Jonas}, journal = {arXiv preprint arXiv:2603.03126}, year = {2026}, url = {https://arxiv.org/abs/2603.03126} } ``` Dataset DOI: [10.57967/hf/7850](https://doi.org/10.57967/hf/7850) ## License This dataset aggregates multiple sources, each with its own license. **Users must comply with the most restrictive license applicable to the sources they use.** | Component | License | |-----------|---------| | Integration code (scripts, pipeline) | MIT | | OpenAlex data | CC0 1.0 (public domain) | | SciSciNet v2 data | CC BY 4.0 | | Papers With Code data | CC BY-SA 4.0 | | Retraction Watch data | Open (via Crossref) | | Reliance on Science data | CC BY-NC 4.0 | | Preprint-to-Paper data | CC BY 4.0 | | Cross-reference tables (`unified_papers`, `topic_ontology_map`) | Derived work — most restrictive source license applies | | Ontologies | Various — see table above; note **MSC2020 is CC BY-NC-SA 4.0** |

提供机构：

J0nasW

搜集汇总

数据集介绍

构建方式

在学术数据整合领域，Science Data Lake 的构建体现了对异构数据源的系统性融合。该数据集通过可复现的 ETL 管道，将来自 OpenAlex、SciSciNet、Papers With Code 等七个核心学术数据源的约 525 GB Parquet 格式数据进行了统一集成。构建过程涵盖了跨数据集的 DOI 规范化处理，并引入了基于 BGE-large-en-v1.5 嵌入模型的语义相似度计算，将 OpenAlex 的 4,516 个主题与涵盖 CSO、MeSH、GO 等在内的 13 个科学本体库的 130 万个术语进行了精准映射，生成了超过 1.6 万条高质量关联。此外，通过共享外部标识符（如 UMLS、Wikidata）建立了跨本体库的桥梁链接，最终形成了包含统一论文表、主题本体映射表等核心结构的便携式数据湖。

使用方法

对于希望利用该数据湖的研究者而言，存在两种主要的使用路径。最便捷的方式是直接使用预构建的 DuckDB 数据库文件，通过安装 httpfs 扩展并挂载远程文件，用户即可使用标准的 SQL 语法对所有七个数据集和十三个本体库进行跨源关联查询。另一种方式则是直接读取分布式的 Parquet 文件，同样借助 DuckDB 实现灵活的数据抽取。数据集配套的 SCHEMA.md 文档经过专门优化，能够辅助 AI 编码代理理解完整的表结构、列类型与连接策略，从而根据自然语言指令生成准确的查询语句。对于需要集成未包含的 Semantic Scholar S2AG 数据或进行数据更新的用户，可遵循项目 GitHub 仓库提供的完整管道脚本执行本地构建与增量更新。

背景与挑战

背景概述

在科学计量学与开放科学蓬勃发展的背景下，跨领域、多源异构学术数据的整合成为深化科学发现的关键瓶颈。Science Data Lake（科学数据湖）由研究人员Jonas Wilinski于2026年构建并发布，旨在通过统一的数据架构整合OpenAlex、SciSciNet、Papers With Code等八个核心学术数据源，覆盖约2.93亿篇学术论文及相关元数据。该数据集的核心研究问题聚焦于打破学术数据孤岛，实现跨数据集的高效查询与联合分析，从而支持对科学影响力、创新性、代码开源状况及学术诚信等多维度复杂研究。其通过预构建的交叉引用表和本体映射，为科学学、文献计量学及人工智能驱动的科研分析提供了前所未有的基础设施，显著提升了大规模学术数据分析的可行性与深度。

当前挑战

该数据集致力于解决科学计量学中跨源学术知识融合与复杂查询的宏观挑战。具体而言，其核心挑战在于如何从异构且规模庞大的数据源中，精准对齐数以亿计的学术实体（如论文、作者、主题），并计算统一的学术影响力与颠覆性指标。在构建过程中，面临多重技术挑战：首先，数据集成需处理各源数据在许可证兼容性、标识符系统（如DOI）归一化及时间快照一致性上的差异；其次，跨本体语义对齐要求利用嵌入模型（如BGE-large-en-v1.5）实现数千个学术主题与13个科学本体中百万级术语的精准映射，同时确保映射质量的可解释性；此外，维护数据管道的可复现性与增量更新能力，以应对上游数据源的持续演进，亦是保障数据集长期效用的关键。

常用场景

经典使用场景

在科学计量学与科学学领域，Science Data Lake数据集为跨源学术信息整合提供了典范。其最经典的使用场景在于支持复杂、多维的学术影响力与创新性分析。研究者能够通过单一查询，将论文的颠覆性指数、引用次数、代码开源状态及撤稿记录等跨数据集指标进行关联分析，从而识别出具有高影响力且具备可复现性的突破性研究成果。这种集成查询能力使得深入探究科学发展的动态模式成为可能。

解决学术问题

该数据集有效解决了科学计量研究中长期存在的数据孤岛与标准化难题。通过整合OpenAlex、SciSciNet等七个异构学术资源，并辅以十三种科学本体论，它构建了一个统一的、可互操作的学术知识图谱。这使研究者能够系统性地分析科学创新的颠覆性、团队合作的模式、基础研究向专利技术的转化路径，以及学术不端行为的宏观影响，极大地推动了科学学领域的实证研究深度与广度。

实际应用

在实际应用层面，Science Data Lake为科研评估、科技政策制定及学术信息服务平台提供了强大的数据基础设施。机构可以利用其评估科研团队的综合影响力与创新质量；政策制定者能够分析特定技术领域（如人工智能或生物医药）的科学基础与专利依赖关系；而学术搜索引擎或推荐系统则可借助其丰富的元数据和本体链接，为用户提供更精准的文献发现与知识导航服务。

数据集最近研究