five

GodotCN/science-datalake

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/GodotCN/science-datalake
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: - cc0-1.0 - cc-by-4.0 - cc-by-sa-4.0 - cc-by-nc-sa-4.0 - cc-by-nc-4.0 size_categories: - 100M<n<1B task_categories: - text-classification - feature-extraction tags: - scholarly - academic - citations - bibliometrics - science-of-science - openalex - sciscinet - papers-with-code - duckdb - parquet - ontologies - knowledge-graph pretty_name: Science Data Lake thumbnail: https://raw.githubusercontent.com/J0nasW/science-datalake/main/sdl_banner.jpg configs: # Cross-reference tables - config_name: unified_papers data_files: "xref/unified_papers/*.parquet" - config_name: topic_ontology_map data_files: "xref/topic_ontology_map/*.parquet" - config_name: ontology_bridges data_files: "xref/ontology_bridges/*.parquet" # OpenAlex (CC0 1.0) - config_name: openalex_works data_files: "openalex/works/*.parquet" - config_name: openalex_authors data_files: "openalex/authors/*.parquet" - config_name: openalex_topics data_files: "openalex/topics/*.parquet" - config_name: openalex_works_topics data_files: "openalex/works_topics/*.parquet" - config_name: openalex_works_authorships data_files: "openalex/works_authorships/*.parquet" - config_name: openalex_works_referenced_works data_files: "openalex/works_referenced_works/*.parquet" - config_name: openalex_works_keywords data_files: "openalex/works_keywords/*.parquet" - config_name: openalex_institutions data_files: "openalex/institutions/*.parquet" # SciSciNet (CC BY 4.0) - config_name: sciscinet_core data_files: "sciscinet/core/*.parquet" - config_name: sciscinet_large data_files: "sciscinet/large/*.parquet" # Papers With Code (CC BY-SA 4.0) - config_name: pwc_papers data_files: "pwc/papers/*.parquet" - config_name: pwc_paper_has_code data_files: "pwc/paper_has_code/*.parquet" - config_name: pwc_methods data_files: "pwc/methods/*.parquet" - config_name: pwc_paper_has_task data_files: "pwc/paper_has_task/*.parquet" - config_name: pwc_datasets data_files: "pwc/datasets/*.parquet" # Other sources - config_name: retwatch data_files: "retwatch/retraction_watch/*.parquet" - config_name: p2p_preprint_to_paper data_files: "p2p/preprint_to_paper/*.parquet" # Reliance on Science (CC BY-NC 4.0) - config_name: ros_patent_paper_pairs data_files: "ros/patent_paper_pairs/*.parquet" - config_name: ros_patent_paper_pairs_plus data_files: "ros/patent_paper_pairs_plus/*.parquet" - config_name: ros_pcs_oa data_files: "ros/pcs_oa/*.parquet" # Ontologies (various licenses, see below) - config_name: ontology_terms data_files: "ontologies/*_terms.parquet" - config_name: ontology_hierarchy data_files: "ontologies/*_hierarchy.parquet" - config_name: ontology_xrefs data_files: "ontologies/*_xrefs.parquet" --- <p align="center"> <img src="https://raw.githubusercontent.com/J0nasW/science-datalake/main/sdl_banner.jpg" alt="Science Data Lake" width="100%"> </p> <p align="center"> <a href="https://arxiv.org/abs/2603.03126"><img src="https://img.shields.io/badge/arXiv-2603.03126-b31b1b" alt="arXiv"></a> <a href="https://github.com/J0nasW/science-datalake"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"></a> <a href="https://doi.org/10.57967/hf/7850"><img src="https://img.shields.io/badge/DOI-10.57967%2Fhf%2F7850-blue" alt="DOI"></a> <a href="https://github.com/J0nasW/science-datalake/blob/main/SCHEMA.md"><img src="https://img.shields.io/badge/LLM--Ready-SCHEMA.md-purple" alt="LLM-Ready"></a> <a href="https://x.com/Jonas_H_W"><img src="https://img.shields.io/badge/Follow-%40Jonas__H__W-black?logo=x" alt="Follow on X"></a> <a href="https://wilinski.me"><img src="https://img.shields.io/badge/Author-wilinski.me-orange" alt="Author website"></a> </p> # Science Data Lake A unified, portable science data lake integrating **7 scholarly datasets** (~525 GB Parquet) with cross-dataset DOI normalization, **13 scientific ontologies** (1.3M terms), and a reproducible ETL pipeline. > **Note:** One additional source (Semantic Scholar S2AG) is supported by the pipeline but is **not redistributed here** due to its API terms of service. See [Not Included in This Upload](#not-included-in-this-upload) below. ## What's Unique This dataset enables queries that are **impossible with any single source**: ```sql -- "Top disruptive papers with open-source code, checking for retractions" SELECT doi, title, year, sciscinet_disruption, -- from SciSciNet oa_cited_by_count, -- from OpenAlex has_pwc, -- from Papers With Code has_retraction -- from Retraction Watch FROM unified_papers WHERE has_pwc AND sciscinet_disruption > 0.5 ORDER BY oa_cited_by_count DESC LIMIT 20 ``` ## Datasets Included | Dataset | Papers/Records | License | Key Contribution | |---------|---------------|---------|-----------------| | **OpenAlex** | 479M works | **CC0 1.0** (public domain) | Broadest coverage, topics, FWCI | | **SciSciNet** v2 | 250M papers | **CC BY 4.0** | Disruption index, atypicality, team size | | **Papers With Code** | 513K papers | **CC BY-SA 4.0** | Method-task-dataset-code links | | **Retraction Watch** | 69K records | **Open** (via Crossref) | Retraction flags + reasons | | **Reliance on Science** | 47.8M pairs | **CC BY-NC 4.0** | Patent-to-paper citation pairs (global) | | **Preprint-to-Paper** | 146K pairs | **CC BY 4.0** | bioRxiv preprint to published paper | | **13 Ontologies** | 1.3M terms | Various (see below) | CSO, MeSH, GO, DOID, ChEBI, NCIT, HPO, EDAM, AGROVOC, UNESCO, STW, MSC2020, PhySH | ### Ontology Licenses | Ontology | License | |----------|---------| | MeSH | Public Domain (US government work) | | GO, ChEBI, NCIT, EDAM, CSO, PhySH, STW | CC BY 4.0 | | DOID | CC0 1.0 | | AGROVOC | CC BY 3.0 IGO | | UNESCO Thesaurus | CC BY-SA 3.0 IGO | | HPO | Custom (free for research use) | | MSC2020 | **CC BY-NC-SA 4.0** (non-commercial) | ### Snapshot Dates Each source was downloaded at a specific point in time: | Dataset | Snapshot / Release | Notes | |---------|-------------------|-------| | OpenAlex | 2026-02-03 | S3 snapshot | | SciSciNet v2 | 2024-11-01 | GCS bucket | | Papers With Code | 2025-07 | Archived JSON | | Retraction Watch | 2025-02 | Crossref CSV | | Reliance on Science | v64 | Zenodo record | | Preprint-to-Paper | 2025-06 | Zenodo record | | 13 Ontologies | 2026-02 | Official sources | All snapshots can be refreshed using the [update pipeline](https://github.com/J0nasW/science-datalake) — see below. ### Not Included in This Upload The following source is supported by the full pipeline ([GitHub](https://github.com/J0nasW/science-datalake)) but is **not redistributed here** due to its API terms of service: | Dataset | Reason | How to obtain | |---------|--------|---------------| | **S2AG** (Semantic Scholar, 231M papers) | License requires individual agreement with Semantic Scholar | [Semantic Scholar Datasets API](https://api.semanticscholar.org/api-docs/datasets) | After downloading S2AG locally, run the full pipeline to integrate it. ## Key Tables ### `unified_papers` (293M rows) The headline table: one row per unique DOI, joining all sources. | Column | Type | Description | |--------|------|-------------| | `doi` | VARCHAR | Normalized DOI (lowercase, no prefix) | | `title` | VARCHAR | Best available title (OpenAlex > S2AG) | | `year` | BIGINT | Publication year | | `openalex_id` | VARCHAR | OpenAlex work ID | | `sciscinet_paperid` | VARCHAR | SciSciNet paper ID | | `has_openalex` | BOOLEAN | Present in OpenAlex | | `has_sciscinet` | BOOLEAN | Present in SciSciNet | | `has_pwc` | BOOLEAN | Has code on Papers With Code | | `has_retraction` | BOOLEAN | Flagged in Retraction Watch | | `has_s2ag` | BOOLEAN | Present in Semantic Scholar | | `has_patent` | BOOLEAN | Cited by at least one patent (RoS) | | `s2ag_corpusid` | BIGINT | Semantic Scholar corpus ID | | `s2ag_citationcount` | INTEGER | S2AG citation count | | `oa_cited_by_count` | BIGINT | OpenAlex citation count | | `sciscinet_disruption` | DOUBLE | Disruption index (CD index) | | `sciscinet_atypicality` | DOUBLE | Atypicality score | | `oa_fwci` | DOUBLE | Field-Weighted Citation Impact | > **Note:** The S2AG columns (`s2ag_corpusid`, `s2ag_citationcount`, `s2ag_influentialcitationcount`, `s2ag_isopenaccess`, `has_s2ag`) are present in the uploaded file but will contain NULL/FALSE values unless S2AG has been integrated locally. All other columns (including `has_patent` from Reliance on Science) are fully populated. ### `topic_ontology_map` Maps OpenAlex's 4,516 topics to terms in 13 scientific ontologies via embedding-based semantic similarity (BGE-large-en-v1.5, 1024-dim) + exact matching for large ontologies (MeSH, ChEBI, NCIT). 16,150 mappings covering 99.8% of topics. Columns include `similarity` (cosine, 0-1) and `match_type` (label/synonym/exact) for quality filtering. ### `ontology_bridges` Cross-ontology links discovered via shared external IDs (UMLS, Wikidata, MESH, etc.). ## Usage with DuckDB ### Option 1: Pre-built database file (recommended) This repository includes a ready-to-use DuckDB database file (`datalake.duckdb`, 274 KB) with 145 SQL views pre-configured to read directly from HuggingFace. Download just this one file and query all 7 datasets immediately — no pipeline setup required. ```python import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") con.execute("ATTACH 'hf://datasets/J0nasW/science-datalake/datalake.duckdb' AS lake") # Query using familiar schema.table syntax df = con.execute(""" SELECT doi, title, year, sciscinet_disruption, oa_cited_by_count FROM lake.xref.unified_papers WHERE sciscinet_disruption IS NOT NULL ORDER BY sciscinet_disruption DESC LIMIT 100 """).df() # Cross-source joins work out of the box con.execute(""" SELECT t.display_name AS topic, o.ontology, o.term_name, o.similarity FROM lake.xref.topic_ontology_map o JOIN lake.openalex.topics t ON t.id = o.topic_id WHERE o.similarity >= 0.85 ORDER BY o.similarity DESC LIMIT 20 """).df() ``` ### Option 2: Direct Parquet queries You can also query individual Parquet files directly without the database file: ```python import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") df = con.execute(""" SELECT doi, title, year, sciscinet_disruption, oa_cited_by_count FROM 'hf://datasets/J0nasW/science-datalake/xref/unified_papers/*.parquet' WHERE sciscinet_disruption IS NOT NULL ORDER BY sciscinet_disruption DESC LIMIT 100 """).df() ``` ## Keeping the Data Current The full pipeline supports incremental updates. When upstream sources release new snapshots: ```bash # Update a single dataset python scripts/datalake_cli.py update openalex # Update all datasets and rebuild cross-reference tables python scripts/datalake_cli.py update python scripts/materialize_unified_papers.py ``` See the [GitHub repository](https://github.com/J0nasW/science-datalake) for full pipeline documentation. ## LLM & AI Agent Integration This data lake ships with **[SCHEMA.md](https://github.com/J0nasW/science-datalake/blob/main/SCHEMA.md)** — a structured reference file optimized for LLM-based coding agents (Claude Code, Cursor, Copilot, etc.). It contains every table, column, type, join strategy, and performance tier in a format that AI agents can use to write correct DuckDB SQL without prior schema knowledge. Point your AI assistant at `SCHEMA.md` and ask it to query across all 7 hosted datasets and 13 ontologies using natural language. ## Building the Full Instance (All 8 Sources) Clone the GitHub repository and run the pipeline to integrate all sources including S2AG: ```bash git clone https://github.com/J0nasW/science-datalake cd science-datalake python scripts/datalake_cli.py download --all python scripts/datalake_cli.py convert --all python scripts/create_unified_db.py python scripts/materialize_unified_papers.py ``` ## Citation If you use the Science Data Lake, please cite the paper: ```bibtex @article{wilinski2026sciencedatalake, title = {The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment}, author = {Wilinski, Jonas}, journal = {arXiv preprint arXiv:2603.03126}, year = {2026}, url = {https://arxiv.org/abs/2603.03126} } ``` Dataset DOI: [10.57967/hf/7850](https://doi.org/10.57967/hf/7850) ## License This dataset aggregates multiple sources, each with its own license. **Users must comply with the most restrictive license applicable to the sources they use.** | Component | License | |-----------|---------| | Integration code (scripts, pipeline) | MIT | | OpenAlex data | CC0 1.0 (public domain) | | SciSciNet v2 data | CC BY 4.0 | | Papers With Code data | CC BY-SA 4.0 | | Retraction Watch data | Open (via Crossref) | | Reliance on Science data | CC BY-NC 4.0 | | Preprint-to-Paper data | CC BY 4.0 | | Cross-reference tables (`unified_papers`, `topic_ontology_map`) | Derived work — most restrictive source license applies | | Ontologies | Various — see table above; note **MSC2020 is CC BY-NC-SA 4.0** |
提供机构:
GodotCN
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作