five

jang1563/evo2-spaceflight-vep

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jang1563/evo2-spaceflight-vep
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - tabular-classification - tabular-regression language: - en tags: - biology - genomics - variant-effect-prediction - evo2 - spaceflight - radiation-biology - ClinVar - deep-mutational-scanning - non-coding-variants pretty_name: "Evo2 Spaceflight Gene Variant Effect Scores" size_categories: - 100K<n<1M --- # Evo2 Zero-Shot VEP Scores for Spaceflight Radiation-Response Genes Pre-computed zero-shot variant effect prediction scores from the [Evo2](https://github.com/ARC-Institute/evo2) genomic foundation model (7B parameters) for **215,001 variants** across 10 spaceflight radiation-response genes. **Code:** [github.com/jang1563/evo2-spaceflight-vep](https://github.com/jang1563/evo2-spaceflight-vep) ## Dataset Description Each row is a single variant (SNV or indel) scored by Evo2 using an 8,192 bp context window with reverse-complement averaging. ### Columns | Column | Type | Description | |--------|------|-------------| | `variant_key` | string | `chrom:pos:ref>alt` identifier | | `gene` | string | Gene symbol (BRCA1, TP53, CHEK2, DNMT3A, TERT, ATM, NFE2L2, CLOCK, MSTN, RAD51) | | `chrom` | string | Chromosome (GRCh38) | | `pos` | int | 0-based genomic position | | `ref` | string | Reference allele | | `alt` | string | Alternate allele | | `region_type` | string | Functional region: `coding`, `intronic`, `promoter`, `utr5`, `utr3`, `intergenic`, or `radiation_del_mh1`--`radiation_del_mh4` (microhomology-flanked deletion regions) | | `delta` | float | Evo2 delta score: `score(alt) - score(ref)`. **More negative = more damaging.** | | `ref_score` | float | Evo2 mean log-likelihood for reference sequence | | `alt_score` | float | Evo2 mean log-likelihood for alternate sequence | | `clinvar_class` | string | ClinVar classification (`P/LP`, `B/LB`, `VUS`, `Conflicting_classifications_of_pathogenicity`, etc.) or empty | | `clinvar_stars` | int | ClinVar review star level (0--4) or empty | ### Gene Summary | Gene | Variants | ClinVar AUROC | Key Role | |------|----------|---------------|----------| | ATM | 52,791 | 0.995 | DSB sensor for radiation damage | | BRCA1 | 36,901 | 0.988 | DNA double-strand break repair | | DNMT3A | 19,693 | 0.996 | DNA methyltransferase; astronaut CH | | CLOCK | 18,369 | -- | Circadian rhythm disruption in space | | TERT | 17,443 | 0.909 | Telomere biology; NASA Twins Study | | CHEK2 | 16,098 | 1.000 | DNA damage checkpoint kinase | | NFE2L2 | 15,230 | -- | Antioxidant response to radiation | | TP53 | 14,240 | 0.989 | Tumor suppressor; astronaut CH | | MSTN | 12,383 | -- | Muscle wasting in microgravity | | RAD51 | 11,853 | -- | Homologous recombination repair | ## Usage ```python from datasets import load_dataset import pandas as pd # Load dataset ds = load_dataset("jang1563/evo2-spaceflight-vep", split="train") df = ds.to_pandas() # Filter to a specific gene atm = df[df["gene"] == "ATM"] # Most damaging variants (most negative delta) damaging = df.nsmallest(100, "delta") # ClinVar pathogenic variants pathogenic = df[df["clinvar_class"] == "P/LP"] # Radiation-type mutations (C>A / G>T, characteristic of 8-oxoguanine) radiation = df[ ((df["ref"] == "C") & (df["alt"] == "A")) | ((df["ref"] == "G") & (df["alt"] == "T")) ] ``` ## Scoring Method ``` score(S) = (1/N) * SUM log P(s_{t+1} | s_1, ..., s_t) delta = score(S_alt) - score(S_ref) ``` - Model: Evo2 7B (`evo2_7b`, bfloat16) - Window: 8,192 bp (optimal via ablation study) - Reverse-complement averaging enabled - Reference genome: GRCh38 ## Validation - **Mean AUROC 0.980** across 6 ClinVar-validatable genes (vs CADD 0.997, AlphaMissense 0.969) - Evo2 achieves **highest AUROC** for CHEK2 (0.9996) and TP53 (0.989) - DMS calibration: Spearman |rho| = 0.23--0.52 across 4 control genes (TP53 rho is negative due to LOF assay directionality) - Non-coding: rho = -0.267 (p = 3.8e-14) vs TERT MPRA - Radiation-type mutations more damaging across all 10 genes (all p < 0.005) ## Intended Use These scores provide **computational supporting evidence** (PP3/BP4 level under ACMG/AMP guidelines) for variant prioritization. They are **NOT standalone diagnostic classifications**. Use in conjunction with clinical evaluation, functional data, and other tools. ### Limitations - Scores reflect evolutionary sequence constraint, not direct functional measurement - Non-coding predictions are exploratory for genes without MPRA validation - 7B model only (40B requires >48 GB VRAM) - ClinVar has known European ancestry overrepresentation ## Citation ```bibtex @software{kim2026evo2vep, title={Evo2 Zero-Shot Variant Effect Prediction for Spaceflight Genes}, author={Kim, JangKeun and Mason, Christopher E.}, year={2026}, url={https://github.com/jang1563/evo2-spaceflight-vep} } ``` ## References - Nguyen et al. (2026). Sequence modeling and design from molecular to genome scale with Evo 2. *Nature*. [doi:10.1038/s41586-026-10176-5](https://doi.org/10.1038/s41586-026-10176-5) - Rutter et al. (2024). Protective alleles for spaceflight. *Nature Communications* 15:6158. [doi:10.1038/s41467-024-50532-5](https://doi.org/10.1038/s41467-024-50532-5) ## Acknowledgments - [Evo2](https://github.com/ARC-Institute/evo2) by ARC Institute - [MaveDB](https://www.mavedb.org/) for DMS data - Mason Lab, Weill Cornell Medicine - Computing resources: Cayuga HPC, Cornell University
提供机构:
jang1563
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作