vivamed/Bio-Atlas

Name: vivamed/Bio-Atlas
Creator: vivamed
Published: 2025-12-01 19:41:20
License: 暂无描述

Hugging Face2025-12-01 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/vivamed/Bio-Atlas

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - text-classification - question-answering language: - en tags: - biology - drug-discovery - genomics - biomedical - knowledge-graph - sql size_categories: - 100M<n<1B --- # 🧬 BioAtlas: The Most Comprehensive Biomedical Knowledge Base **490+ million rows integrating genetics, expression, perturbations, drugs, and disease into one harmonized, queryable database.** [![License](https://img.shields.io/badge/License-CC%20BY--SA%203.0-blue.svg)](https://creativecommons.org/licenses/by-sa/3.0/) [![Scale](https://img.shields.io/badge/Scale-490M%20Rows-red)]() [![Sources](https://img.shields.io/badge/Sources-40%2B-blue)]() [![Colocalization](https://img.shields.io/badge/Coloc-80M-orange)]() --- ## 📖 Introduction: Solving the Biomedical Data Integration Crisis Modern biomedical research produces incredible datasets, but they're **isolated silos**: - GWAS Catalog has genetics, but no gene expression - LINCS has drug perturbations, but no genetic evidence - ChEMBL has drugs, but no patient data or genomics - Open Targets connects some pieces, but missing perturbation data **The result?** Researchers spend months manually linking databases just to ask: *"Which drugs target genetically validated disease genes AND have good safety profiles?"* ### BioAtlas Solves This **One PostgreSQL database. 40+ sources integrated. 490M+ rows. Unified IDs. Harmonized coordinates. Cross-platform normalized.** Query across genetics + expression + perturbations + drugs + safety in **one SQL statement**. What takes months with raw databases takes seconds with BioAtlas. --- ## 🏆 Complete Coverage: Drugs, Diseases, Pathways, Mechanisms **BioAtlas provides EVERYTHING about drugs, diseases, pathways, and mechanisms. Competitors give you fragments.** ### Drug Coverage (COMPLETE) | What You Need | BioAtlas | Clarivate | STRING | Open Targets | |---|---|---|---|---| | **Drug Information** | ✅ 10,380 drugs | ⚠️ Limited | ❌ None | ⚠️ Limited | | **Drug Identifiers** | ✅ 27,927 (ChEMBL, PubChem) | ⚠️ Some | ❌ None | ⚠️ Some | | **Drug-Target** | ✅ 24,987 (11 sources merged) | ✅ Yes | ❌ None | ✅ Yes | | **Binding Affinities** | ✅ 1,353,181 (pChEMBL unified) | ⚠️ Limited | ❌ None | ⚠️ Some | | **Drug-Disease** | ✅ 47,642 indications | ✅ Yes | ❌ None | ✅ Yes | | **Adverse Events** | ✅ 1,445,327 (FDA FAERS) | ✅ Yes | ❌ None | ⚠️ Limited | | **Drug Perturbations** | ✅ 720K LINCS + 1.5M Tahoe | ⚠️ Limited | ❌ None | ❌ None | | **Drug Screening** | ✅ 8.6M combo + 3.8M PRISM | ❌ None | ❌ None | ❌ None | | **Activity Scores** | ✅ 204M TF/pathway | ❌ None | ❌ None | ❌ None | **BioAtlas:** COMPLETE drug knowledge (mechanisms, targets, safety, perturbations, screening) --- ### Disease Coverage (COMPLETE) | What You Need | BioAtlas | Clarivate | STRING | Open Targets | |---|---|---|---|---| | **Disease Ontology** | ✅ 30,259 (MONDO) | ✅ Yes | ❌ None | ✅ Yes | | **Disease Definitions** | ✅ 22,205 detailed | ⚠️ Some | ❌ None | ⚠️ Some | | **Disease Cross-Refs** | ✅ 104,423 (ICD, OMIM, etc) | ✅ Yes | ❌ None | ✅ Yes | | **Gene-Disease** | ✅ 1,141,217 associations | ✅ Yes | ❌ None | ✅ Yes | | **Disease-Phenotype** | ✅ 272,246 (HPO) | ⚠️ Limited | ❌ None | ✅ Yes | | **Genetic Variants** | ✅ 299,791 curated | ⚠️ Limited | ❌ None | ✅ Yes | | **GWAS Associations** | ✅ 112,823 variant-disease | ⚠️ Limited | ❌ None | ✅ Yes | | **Colocalization** | ✅ 80M causal genes | ❌ None | ❌ None | ⚠️ Few | **BioAtlas:** COMPLETE disease knowledge (genetics, phenotypes, variants, colocalization) --- ### Pathway & Mechanism Coverage (COMPLETE) | What You Need | BioAtlas | Clarivate | STRING | Open Targets | |---|---|---|---|---| | **Pathway Definitions** | ✅ 2,781 (Reactome) | ✅ Yes | ❌ None | ✅ Yes | | **Pathway Members** | ✅ 136,614 gene-pathway | ✅ Yes | ❌ None | ✅ Yes | | **Pathway Footprints** | ✅ 252,769 (PROGENy) | ❌ None | ❌ None | ❌ None | | **TF Regulation** | ✅ 31,953 edges (DoRothEA) | ⚠️ Limited | ❌ None | ⚠️ Limited | | **Ligand-Receptor** | ✅ 20,911,544 pairs | ⚠️ Limited | ❌ None | ⚠️ Some | | **Signaling** | ✅ 29,581 (OmniPath) | ✅ Yes | ⚠️ PPI | ⚠️ Limited | | **GO Annotations** | ✅ 343,641 | ✅ Yes | ✅ Yes | ✅ Yes | | **TF/Pathway Activities** | ✅ 204M scores | ❌ None | ❌ None | ❌ None | **BioAtlas:** COMPLETE mechanism knowledge (pathways, TFs, L-R, signaling, activities) --- ### Integration & Normalization (UNIQUE) | Feature | BioAtlas | Everyone Else | |---|---|---| | **All-in-One Database** | ✅ SQL joins across 40+ sources | ❌ Separate downloads | | **ID Harmonization** | ✅ Ensembl↔ChEMBL↔MONDO | ⚠️ Manual mapping | | **DMSO Normalization** | ✅ Multi-level cascade | ❌ Raw data | | **pChEMBL Scale** | ✅ 1.35M unified | ⚠️ Mixed units | | **GRCh38 Coordinates** | ✅ All harmonized | ⚠️ Mixed builds | | **Cross-Platform Activities** | ✅ 204M comparable | ❌ Platform-specific | | **Multi-Source Merges** | ✅ Adverse events, L-R, targets | ❌ Pick one source | | **Colocalization** | ✅ 80M precomputed | ❌ Compute yourself | | **Local SQL** | ✅ Full access | ⚠️ APIs/Web only | | **Price** | **FREE** | $100K-$200K/year | --- ### The Bottom Line: **For DRUGS:** BioAtlas has targets (25K), affinities (1.35M), indications (47K), safety (1.45M), perturbations (2.2M), screening (12M) - **ALL in one place** **For DISEASES:** BioAtlas has ontology (30K), genetics (1.14M associations), phenotypes (272K), variants (299K), colocalization (80M) - **COMPLETE coverage** **For PATHWAYS:** BioAtlas has definitions (2.7K), members (137K), footprints (253K), TF regulation (32K), activities (204M) - **FULL mechanism knowledge** **For INTEGRATION:** BioAtlas is the ONLY one that connects all of these in one queryable database with advanced normalization **Competitors:** Have 1-2 of these. BioAtlas has **ALL**. --- ## ⚡ Quick Start: Two Ways to Access ### Option A: 🐍 Python/Pandas (Quick Exploration) **No database required!** Load individual tables directly with pandas: ```python import pandas as pd # Load any table directly from HuggingFace drugs = pd.read_parquet("hf://datasets/vivamed/Bio-Atlas/parquet/drug.parquet") diseases = pd.read_parquet("hf://datasets/vivamed/Bio-Atlas/parquet/disease.parquet") drug_targets = pd.read_parquet("hf://datasets/vivamed/Bio-Atlas/parquet/drug_target.parquet") # Load the flagship 80M colocalization data coloc = pd.read_parquet("hf://datasets/vivamed/Bio-Atlas/parquet/coloc_bayesian.parquet") print(f"Drugs: {len(drugs):,}") # 10,380 print(f"Diseases: {len(diseases):,}") # 30,259 print(f"Colocalization: {len(coloc):,}") # 79,926,756 ``` **Available Parquet Files (35 tables, 116M+ rows):** | Category | Tables | Key Files | |----------|--------|-----------| | **Core Entities** | 4 | `drug.parquet`, `disease.parquet`, `gene.parquet`, `pathway.parquet` | | **Drug Data** | 7 | `drug_target.parquet`, `drug_adverse_event.parquet`, `drug_disease.parquet` | | **Genetics** | 4 | `variant.parquet`, `variant_disease.parquet`, `coloc_bayesian.parquet` (80M!) | | **Perturbations** | 5 | `l1000_signature.parquet`, `tahoe_activity.parquet`, `prism_response.parquet` | | **Networks** | 5 | `lr_interaction.parquet` (21M), `tf_regulation.parquet`, `pathway_member.parquet` | | **Ontologies** | 5 | `go_term.parquet`, `hpo_term.parquet`, `disease_hpo.parquet` | | **Screening** | 3 | `drugcomb_summary.parquet`, `depmap_gene_dependency.parquet` | --- ### Option B: 🐘 PostgreSQL (Full Power) **For complex multi-table queries and the complete 490M+ row database.** #### Prerequisites - PostgreSQL 14+ installed - ~30 GB free disk space - ~8 GB RAM minimum #### 1. Download Files (Total ~26 GB) ```bash # Core Knowledge Graph (14.2 GB) - Drugs, diseases, pathways, networks huggingface-cli download vivamed/Bio-Atlas bioatlas_public_v1.0.dump # LINCS Activity Scores (5.1 GB) - 202M perturbation scores huggingface-cli download vivamed/Bio-Atlas bio_kg_v1.0.dump # Colocalization Data (6.5 GB) - 80M GWAS colocalization tests huggingface-cli download vivamed/Bio-Atlas coloc_bayesian.dump ``` #### 2. Load Database ```bash # Create database createdb bioatlas # 1. Load Core Tables (SQL format) ~15 mins psql -d bioatlas -f bioatlas_public_v1.0.dump # 2. Load LINCS Activities (Binary format) ~10 mins pg_restore -d bioatlas bio_kg_v1.0.dump # 3. Load Colocalization (Binary format) ~10 mins pg_restore -d bioatlas coloc_bayesian.dump ``` #### 3. Verify Installation ```sql psql -d bioatlas -c "\dt" -- List tables (should see 79 tables) psql -d bioatlas -c "SELECT COUNT(*) FROM drug;" -- Should be 10,380 psql -d bioatlas -c "SELECT COUNT(*) FROM l1000_activity;" -- Should be 202,282,258 psql -d bioatlas -c "SELECT COUNT(*) FROM coloc_bayesian;" -- Should be 79,926,756 ``` --- ## 🌍 Scope: The Breadth of Integration ### What We Integrated (40+ Databases) **Genetics (2.5+ Billion Variant Associations):** - GWAS Catalog: 443,634 trait studies - UK Biobank: Full summary statistics - **80 million colocalization tests** linking variants to causal genes - GTEx: 415K tissue-specific eQTLs - eQTL Catalogue: 90K multi-tissue eQTLs - Open Targets: 1.14M gene-disease associations with L2G scores **Gene Expression (85+ Million Measurements):** - **LINCS L1000:** 57.6M gene measurements (720K signatures, 33,609 compounds) - **CELLxGENE:** 28.2M single-cell measurements (714 cell types) - **GTEx:** 2M tissue expression measurements - **Tahoe-100M:** 1.5M activity scores from 100M cells (379 drugs, 50 cell lines) **Drug Perturbations (204+ Million Activity Scores):** - **LINCS L1000:** 202M TF/pathway activity scores - **Tahoe-100M:** 1.55M activity scores (DMSO-normalized where available) - **SC CRISPR:** 1.6M activities from 208K gene knockouts (4 studies) - **Cross-platform harmonized** using same TF/pathway networks **Drug Discovery (25K+ Interactions):** - **ChEMBL:** 10K drugs, mechanisms, indications - **BindingDB:** 1.35M binding affinities → pChEMBL unified scale - **Drug-Target:** 25K interactions from 11 sources merged - **Adverse Events:** 1.45M records (FDA FAERS) - **Clinical Indications:** 47K drug-disease pairs **Drug Screening (12+ Million Experiments):** - **DrugComb:** 8.6M combination experiments (RAW data) - **PRISM:** 3.8M drug sensitivity measurements - **DepMap:** 1.3M CRISPR gene dependencies (essentiality) **Interactions (21+ Million Edges):** - **Ligand-Receptor:** 20.9M pairs (CellPhoneDB + CellChatDB) - **Protein Signaling:** 29K interactions (OmniPath, commercial-filtered) - **TF Regulation:** 32K edges (DoRothEA, commercial-filtered) - **Pathway Membership:** 137K gene-pathway links **Ontologies (Complete):** - MONDO (30K diseases), HPO (19K phenotypes) - GO (48K terms, 344K annotations) - Cell Ontology (714 types), UBERON (78 tissues) - 5,249 EFO-MONDO cross-references --- ## 📏 Scale: The Numbers ### Total Data Volume - **490+ million documented rows** - **79 core tables** (public release) - **40+ data sources** harmonized - **~26 GB compressed** (14.2 + 5.1 + 6.5 GB) ### Integration Achievements - **43,608 genes** harmonized across Ensembl/HGNC/UniProt/Entrez - **30,259 diseases** unified via MONDO ontology - **10,380 drugs** linked across ChEMBL/Broad/PubChem - **1.35M potencies** normalized to pChEMBL scale - **All coordinates** standardized to GRCh38 ### Unique at Scale - **79,926,756 colocalization tests** (GWAS × eQTL Bayesian analysis) - **63,332,122 with strong evidence** (H4 > 0.8) - identifies causal genes - **204.5M activity scores** across 3 platforms (comparable framework) - **1.45M adverse events** from FDA FAERS - **20.9M ligand-receptor pairs** from 2 databases merged --- ## 🔧 Normalization: The Cleaning We Did ### 1. DMSO Cascade Normalization (Tahoe-100M) **Problem:** Plate effects, batch effects, vehicle toxicity confound drug signals **Our Multi-Level Solution:** ``` Cascade Matching Strategy: Priority 1: Exact Match (~60% of scores) - Same plate + cell line + time + feature - drug_score - dmso_score = perfect batch correction - Highest confidence scores Priority 2: Plate Match (Additional contexts) - Same plate + feature (different cell/time) - drug_score - mean(plate_dmso) - Corrects major plate effects Priority 3: Raw Score (~40% of scores) - No matching DMSO available - Raw score with NO_CONTROL flag - Still biologically meaningful, documented limitation ``` **Quality Flags:** Every score tagged with control availability **User Choice:** Filter to normalized scores for highest confidence, or use all with documented caveats ### 2. pChEMBL Potency Unification **Problem:** Binding data chaos across databases **Before:** - Ki in nM from ChEMBL - Kd in μM from BindingDB - IC50 in various units - EC50 in mg/mL - **Incomparable!** **After:** ``` All → pChEMBL = -log10(molar) Measurement Quality Ranking: Ki (equilibrium) > Kd (dissociation) > IC50 (functional) > EC50 (functional) Result: 1,353,181 directly comparable potencies ``` ### 3. GWAS Harmonization to GRCh38 **Problem:** Different builds, coordinates, alleles across 443K studies **Our Pipeline:** ``` 1. Coordinate liftover to GRCh38 2. Variant ID standardization: chr:pos:ref:alt 3. Allele harmonization (consistent effect alleles) 4. QC filters: - MAF > 0.01 - INFO > 0.8 - p < 5×10⁻⁸ 5. Ready for colocalization ``` **Result:** Clean, queryable, colocalizable genetic data ### 4. Cross-Platform Activity Normalization **Problem:** LINCS (978 genes), Tahoe (62K genes), CRISPR (variable) - how to compare? **Solution:** Apply same networks (DoRothEA + PROGENy) to all platforms | Platform | Input Genes | TF Activities | Pathway Activities | Comparability | |---|---|---|---|---| | LINCS L1000 | 978 | 202M scores | 10M scores | ✅ Same TFs/pathways | | Tahoe-100M | 62,710 | 1.2M scores | 350K scores | ✅ Same TFs/pathways | | SC CRISPR | 8-23K | 1.6M scores | - | ✅ Same TFs/pathways | **Result:** TP53 activity in LINCS comparable to TP53 in Tahoe - cross-platform validation! --- ## 💡 Core Innovation: 99.6% Dimensional Reduction **After integration, we solve the interpretation problem:** ### The Gene Expression Challenge - 60,000+ genes measured per experiment - 99% unchanged or noise - Biologically overwhelming ### Our Network-Based Solution ``` Gene Expression (60,710 dimensions) - Noisy, uninterpretable ↓ Filter to network genes (~6,000 relevant) ↓ Apply DoRothEA (31,953 TF-target edges, 242 TFs) ↓ TF Activities (242 dimensions) - Regulatory interpretation ↓ Apply PROGENy (252,769 weights, 14 pathways) ↓ Pathway Activities (14 dimensions) - Mechanism interpretation ``` **Reduction:** 60,710 → 256 features (99.6%) **Method:** ULM (Univariate Linear Model) **Formula:** `activity = Σ(expression × sign) / √n_targets` **Example - Paclitaxel:** - **Before:** 60,710 gene changes (noise!) - **After:** Trail pathway +2.8 (apoptosis), TP53 +1.9 (tumor suppressor) - **Insight:** Mechanism-of-action clear! --- ## 🔥 What Only BioAtlas Has ### 1. 80 Million Colocalization Tests **GWAS identifies variant-disease associations, but which gene is causal?** **BioAtlas answer:** 79,926,756 Bayesian colocalization tests (GWAS × eQTL) - 63,332,122 with strong causal evidence (H4 > 0.8) - 70% have very strong evidence (H4 > 0.9) - Precomputed (weeks of analysis) - Instant causal gene queries **Example Query:** ```sql -- Find genes causally linked to Type 2 Diabetes SELECT DISTINCT c."rightStudyLocusId" as gene_study, c.h4 as probability_causal, c.chromosome FROM coloc_bayesian c WHERE c."leftStudyLocusId" LIKE '%diabetes%' AND c.h4 > 0.9 ORDER BY c.h4 DESC LIMIT 10; ``` ### 2. Multi-Source Merges (Not Separate Downloads) **Adverse Events:** FDA_FAERS = 1.45M (comprehensive) **Ligand-Receptor:** CellPhoneDB + CellChatDB = 20.9M (comprehensive) **Drug-Target:** ChEMBL + 10 sources = 25K (evidence-weighted) ### 3. Cross-Platform Harmonization **Same TF/pathway framework across all perturbation datasets:** - Compare LINCS drug → Tahoe drug → CRISPR gene knockout - Validate mechanisms across platforms - 204.5M scores using consistent networks ### 4. Complete Local SQL Access **No APIs, no rate limits, no internet required** - Download once, query forever - Full PostgreSQL power - Multi-hop joins across all 40+ sources --- ## 💻 Usage Examples ### Example 1: Find Drugs Targeting Genetically Validated Genes ```sql -- Find FDA-approved drugs targeting genes with strong -- genetic evidence for Alzheimer's disease SELECT DISTINCT d.drug_name, d.max_phase, g.hgnc_symbol, dg.association_score, dt.action_type FROM drug d JOIN drug_target dt ON d.molecule_chembl_id = dt.molecule_chembl_id JOIN gene g ON dt.ensembl_gene_id = g.ensembl_gene_id JOIN disease_gene dg ON g.ensembl_gene_id = dg.ensembl_gene_id JOIN disease dis ON dg.mondo_id = dis.mondo_id WHERE dis.disease_label ILIKE '%Alzheimer%' AND d.fda_approved_us = TRUE AND dg.association_score > 0.5 ORDER BY dg.association_score DESC; ``` ### Example 2: Drug Mechanism-of-Action Analysis ```sql -- What pathways does Paclitaxel activate across cell lines? SELECT tc.cell_line_name, ta.feature_id as pathway, AVG(ta.score) as mean_activity, COUNT(*) as n_measurements FROM tahoe_context tc JOIN tahoe_activity ta ON tc.context_id = ta.context_id WHERE tc.drug_name = 'Paclitaxel' AND ta.provider = 'PROGENy' AND ta.feature_type = 'pathway' GROUP BY tc.cell_line_name, ta.feature_id HAVING ABS(AVG(ta.score)) > 2.0 ORDER BY ABS(AVG(ta.score)) DESC; ``` ### Example 3: Colocalization-Based Drug Repurposing ```sql -- Find drugs targeting genes with strong colocalization -- evidence for inflammatory bowel disease SELECT DISTINCT d.drug_name, g.hgnc_symbol, c.h4 as colocalization_prob, dt.action_type FROM coloc_bayesian c JOIN gene g ON c."rightStudyLocusId" LIKE '%' || g.ensembl_gene_id || '%' JOIN drug_target dt ON g.ensembl_gene_id = dt.ensembl_gene_id JOIN drug d ON dt.molecule_chembl_id = d.molecule_chembl_id WHERE c."leftStudyLocusId" LIKE '%inflammatory_bowel%' AND c.h4 > 0.9 AND d.max_phase >= 2 LIMIT 20; ``` ### Example 4: Cross-Platform Validation ```sql -- Compare drug effects in LINCS vs Tahoe for the same compound SELECT 'LINCS' as platform, l.pert_iname as drug_name, COUNT(*) as signatures FROM l1000_signature l WHERE l.molecule_chembl_id = 'CHEMBL1201259' -- Imatinib GROUP BY l.pert_iname UNION ALL SELECT 'Tahoe', tc.drug_name, COUNT(DISTINCT tc.context_id) FROM tahoe_context tc WHERE tc.molecule_chembl_id = 'CHEMBL1201259' GROUP BY tc.drug_name; ``` --- ## 🔧 Advanced Features ### 1. Multi-Hop Graph Queries BioAtlas enables complex multi-hop queries that would require downloading and linking dozens of separate databases: ```sql -- Find drugs that: -- 1. Target genes with colocalization evidence -- 2. Have favorable safety profiles -- 3. Are in clinical trials SELECT d.drug_name, d.max_phase, g.hgnc_symbol, c.h4 as genetic_evidence, COUNT(DISTINCT dae.meddra_pt) as adverse_event_count FROM drug d JOIN drug_target dt ON d.molecule_chembl_id = dt.molecule_chembl_id JOIN gene g ON dt.ensembl_gene_id = g.ensembl_gene_id JOIN coloc_bayesian c ON c."rightStudyLocusId" LIKE '%' || g.ensembl_gene_id || '%' LEFT JOIN drug_adverse_event dae ON d.molecule_chembl_id = dae.molecule_chembl_id WHERE c.h4 > 0.8 AND d.max_phase >= 2 GROUP BY d.drug_name, d.max_phase, g.hgnc_symbol, c.h4 HAVING COUNT(DISTINCT dae.meddra_pt) < 10 -- Good safety profile ORDER BY c.h4 DESC LIMIT 20; ``` ### 2. Pathway Enrichment Analysis ```sql -- Which pathways are enriched in genes associated with Type 2 Diabetes? SELECT p.pathway_label, COUNT(DISTINCT pm.ensembl_gene_id) as n_genes, AVG(dg.association_score) as avg_association FROM disease dis JOIN disease_gene dg ON dis.mondo_id = dg.mondo_id JOIN pathway_member pm ON dg.ensembl_gene_id = pm.ensembl_gene_id JOIN pathway p ON pm.pathway_id = p.pathway_id WHERE dis.disease_label ILIKE '%diabetes%type 2%' GROUP BY p.pathway_label HAVING COUNT(DISTINCT pm.ensembl_gene_id) >= 5 ORDER BY COUNT(DISTINCT pm.ensembl_gene_id) DESC; ``` ### 3. Drug Combination Synergy Analysis ```sql -- Find synergistic drug combinations from DrugComb SELECT dc.drug_row, dc.drug_col, dc.cell_line_name, dc.mean_synergy_zip, dc.mean_synergy_bliss FROM drugcomb_experiment dc WHERE dc.mean_synergy_zip > 10 -- Strong synergy AND dc.num_dose_points >= 16 -- Well-powered experiment ORDER BY dc.mean_synergy_zip DESC LIMIT 20; ``` --- ## 📊 Data Dictionary See [DATA_DICTIONARY.md](DATA_DICTIONARY.md) for complete table schemas and column descriptions. **Quick Reference:** | Category | Tables | Key Tables | |---|---|---| | **Core Entities** | 6 | drug, disease, gene, pathway | | **Drug Data** | 9 | drug_target, drug_adverse_event, drug_disease | | **Disease Data** | 3 | disease_gene, disease_hpo | | **Colocalization** | 2 | coloc_bayesian, coloc_ecaviar | | **LINCS L1000** | 2 | l1000_signature, l1000_activity | | **Tahoe (scRNA)** | 3 | tahoe_activity, tahoe_context | | **SC CRISPR** | 4 | sc_activity, sc_perturbation | | **Drug Screening** | 9 | drugcomb_full_response, prism_response | | **Networks** | 7 | lr_interaction, pathway_member, tf_regulation | | **Ontologies** | 11 | hpo_term, go_term | | **Variants/eQTLs** | 7 | variant, variant_disease, gtex_v10_eqtl_leads | | **Gene Data** | 5 | gene_celltype_expression | **Total:** 79 tables, 490M+ rows --- ## 🔥 What Only BioAtlas Has ### 1. 80 Million Colocalization Tests **GWAS identifies regions, but colocalization identifies genes.** - **79,926,756** Bayesian tests (GWAS × eQTL/pQTL/sQTL) - **63,332,122** with strong causal evidence (H4 > 0.8) - **70%** have very strong evidence (H4 > 0.9) - Covers all 23 chromosomes - Links 443K GWAS studies × 1.3M molecular QTL studies **Why this matters:** A GWAS hit tells you "this region is associated with disease." Colocalization tells you "this specific gene is probably causal." That's the difference between a lead and a validated target. ### 2. Dimensionality Reduction (99.6%) Gene expression is noisy (60,000 genes, 99% unchanged). We reduce it to **pathway activities**: - **DoRothEA:** 242 Transcription Factors - **PROGENy:** 14 Signaling Pathways **Result:** Interpretable mechanism-of-action profiles for every drug. **Example:** ``` Raw data: Gene X +1.2, Gene Y -0.8, Gene Z +0.3, ... (60,000 genes) Reduced: MAPK pathway +3.2, TP53 activity +2.1, Trail pathway +1.8 ``` ### 3. Cross-Platform Activity Harmonization **Same TF/pathway scores across all perturbation platforms:** - **LINCS L1000:** 202M activity scores - **Tahoe-100M:** 1.55M activity scores - **SC CRISPR:** 1.6M activity scores **Impact:** You can directly compare a drug's MAPK activation in LINCS to a gene knockout's MAPK effect in CRISPR. Cross-platform validation increases confidence 10x. ### 4. Multi-Source Integration **Competitors force you to pick ONE source. We merged ALL sources:** | Feature | Sources Merged | Result | |---|---|---| | Drug-Target | 11 sources | 24,987 interactions (evidence-weighted) | | Adverse Events | FDA FAERS | 1.45M events (comprehensive) | | Ligand-Receptor | CellPhoneDB + CellChatDB | 20.9M pairs (complete) | --- ## 📜 License & Attribution **License:** [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) ### What You Can Do: ✅ Use for commercial drug discovery ✅ Build AI/ML models ✅ Create products and sell them ✅ Modify and redistribute the database ### What You Must Do: ⚠️ Provide attribution to BioAtlas ⚠️ Share derivative databases under CC BY-SA 3.0 ⚠️ Include license notice ### Key Data Sources: - **Open Targets Genetics:** Colocalization data (Apache 2.0) - **ChEMBL:** Drug data (CC BY-SA 3.0) - **LINCS L1000:** Perturbation data (CC BY 4.0) - **FDA FAERS:** Adverse events (Public Domain) - **Reactome/GO/MONDO:** Ontologies (CC BY 4.0) - **DepMap, PRISM, DrugComb:** Screening data (CC BY 4.0) - **GTEx, eQTL Catalogue:** eQTL data (Open Access) *See [LICENSES_AND_ATTRIBUTIONS.md](LICENSES_AND_ATTRIBUTIONS.md) for complete details.* --- ## 📖 Citation If you use BioAtlas in your research, please cite: ```bibtex @dataset{bioatlas_2025, title={BioAtlas: Integrated Multi-Omic Biomedical Knowledge Base}, author={Harris, Nicholas and Vivamed and EveryCure}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/datasets/vivamed/Bio-Atlas} } ``` --- ## 📚 Additional Documentation - **[DATA_DICTIONARY.md](DATA_DICTIONARY.md)** - Complete table schemas and column descriptions - **[RESEARCH_ARTICLE.md](RESEARCH_ARTICLE.md)** - Technical methodology and processing details - **[schema.json](schema.json)** - Machine-readable schema for programmatic access - **[LICENSES_AND_ATTRIBUTIONS.md](LICENSES_AND_ATTRIBUTIONS.md)** - Complete attribution for all 40+ sources --- ## 🤝 Support & Contact **Issues?** Open an issue on this repository **Questions?** Contact: [Your Contact Info] **Commercial Use?** Fully allowed under CC BY-SA 3.0 --- ## 🎉 Acknowledgments BioAtlas integrates data from 40+ publicly available databases. We thank all the researchers, institutions, and consortia who make their data openly available: - EMBL-EBI (ChEMBL, Open Targets) - Broad Institute (LINCS, DepMap, PRISM) - NHGRI-EBI (GWAS Catalog) - GTEx Consortium - Gene Ontology Consortium - Monarch Initiative (MONDO, HPO) - And many more... **Your work makes open science possible.** 🙏 --- **BioAtlas: The most comprehensive biomedical knowledge base, freely available to accelerate drug discovery and precision medicine.** 🧬

提供机构：

vivamed

5,000+

优质数据集

54 个

任务类型

进入经典数据集