jang1563/evo2-spaceflight-vep
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jang1563/evo2-spaceflight-vep
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-classification
- tabular-regression
language:
- en
tags:
- biology
- genomics
- variant-effect-prediction
- evo2
- spaceflight
- radiation-biology
- ClinVar
- deep-mutational-scanning
- non-coding-variants
pretty_name: "Evo2 Spaceflight Gene Variant Effect Scores"
size_categories:
- 100K<n<1M
---
# Evo2 Zero-Shot VEP Scores for Spaceflight Radiation-Response Genes
Pre-computed zero-shot variant effect prediction scores from the [Evo2](https://github.com/ARC-Institute/evo2) genomic foundation model (7B parameters) for **215,001 variants** across 10 spaceflight radiation-response genes.
**Code:** [github.com/jang1563/evo2-spaceflight-vep](https://github.com/jang1563/evo2-spaceflight-vep)
## Dataset Description
Each row is a single variant (SNV or indel) scored by Evo2 using an 8,192 bp context window with reverse-complement averaging.
### Columns
| Column | Type | Description |
|--------|------|-------------|
| `variant_key` | string | `chrom:pos:ref>alt` identifier |
| `gene` | string | Gene symbol (BRCA1, TP53, CHEK2, DNMT3A, TERT, ATM, NFE2L2, CLOCK, MSTN, RAD51) |
| `chrom` | string | Chromosome (GRCh38) |
| `pos` | int | 0-based genomic position |
| `ref` | string | Reference allele |
| `alt` | string | Alternate allele |
| `region_type` | string | Functional region: `coding`, `intronic`, `promoter`, `utr5`, `utr3`, `intergenic`, or `radiation_del_mh1`--`radiation_del_mh4` (microhomology-flanked deletion regions) |
| `delta` | float | Evo2 delta score: `score(alt) - score(ref)`. **More negative = more damaging.** |
| `ref_score` | float | Evo2 mean log-likelihood for reference sequence |
| `alt_score` | float | Evo2 mean log-likelihood for alternate sequence |
| `clinvar_class` | string | ClinVar classification (`P/LP`, `B/LB`, `VUS`, `Conflicting_classifications_of_pathogenicity`, etc.) or empty |
| `clinvar_stars` | int | ClinVar review star level (0--4) or empty |
### Gene Summary
| Gene | Variants | ClinVar AUROC | Key Role |
|------|----------|---------------|----------|
| ATM | 52,791 | 0.995 | DSB sensor for radiation damage |
| BRCA1 | 36,901 | 0.988 | DNA double-strand break repair |
| DNMT3A | 19,693 | 0.996 | DNA methyltransferase; astronaut CH |
| CLOCK | 18,369 | -- | Circadian rhythm disruption in space |
| TERT | 17,443 | 0.909 | Telomere biology; NASA Twins Study |
| CHEK2 | 16,098 | 1.000 | DNA damage checkpoint kinase |
| NFE2L2 | 15,230 | -- | Antioxidant response to radiation |
| TP53 | 14,240 | 0.989 | Tumor suppressor; astronaut CH |
| MSTN | 12,383 | -- | Muscle wasting in microgravity |
| RAD51 | 11,853 | -- | Homologous recombination repair |
## Usage
```python
from datasets import load_dataset
import pandas as pd
# Load dataset
ds = load_dataset("jang1563/evo2-spaceflight-vep", split="train")
df = ds.to_pandas()
# Filter to a specific gene
atm = df[df["gene"] == "ATM"]
# Most damaging variants (most negative delta)
damaging = df.nsmallest(100, "delta")
# ClinVar pathogenic variants
pathogenic = df[df["clinvar_class"] == "P/LP"]
# Radiation-type mutations (C>A / G>T, characteristic of 8-oxoguanine)
radiation = df[
((df["ref"] == "C") & (df["alt"] == "A")) |
((df["ref"] == "G") & (df["alt"] == "T"))
]
```
## Scoring Method
```
score(S) = (1/N) * SUM log P(s_{t+1} | s_1, ..., s_t)
delta = score(S_alt) - score(S_ref)
```
- Model: Evo2 7B (`evo2_7b`, bfloat16)
- Window: 8,192 bp (optimal via ablation study)
- Reverse-complement averaging enabled
- Reference genome: GRCh38
## Validation
- **Mean AUROC 0.980** across 6 ClinVar-validatable genes (vs CADD 0.997, AlphaMissense 0.969)
- Evo2 achieves **highest AUROC** for CHEK2 (0.9996) and TP53 (0.989)
- DMS calibration: Spearman |rho| = 0.23--0.52 across 4 control genes (TP53 rho is negative due to LOF assay directionality)
- Non-coding: rho = -0.267 (p = 3.8e-14) vs TERT MPRA
- Radiation-type mutations more damaging across all 10 genes (all p < 0.005)
## Intended Use
These scores provide **computational supporting evidence** (PP3/BP4 level under ACMG/AMP guidelines) for variant prioritization. They are **NOT standalone diagnostic classifications**. Use in conjunction with clinical evaluation, functional data, and other tools.
### Limitations
- Scores reflect evolutionary sequence constraint, not direct functional measurement
- Non-coding predictions are exploratory for genes without MPRA validation
- 7B model only (40B requires >48 GB VRAM)
- ClinVar has known European ancestry overrepresentation
## Citation
```bibtex
@software{kim2026evo2vep,
title={Evo2 Zero-Shot Variant Effect Prediction for Spaceflight Genes},
author={Kim, JangKeun and Mason, Christopher E.},
year={2026},
url={https://github.com/jang1563/evo2-spaceflight-vep}
}
```
## References
- Nguyen et al. (2026). Sequence modeling and design from molecular to genome scale with Evo 2. *Nature*. [doi:10.1038/s41586-026-10176-5](https://doi.org/10.1038/s41586-026-10176-5)
- Rutter et al. (2024). Protective alleles for spaceflight. *Nature Communications* 15:6158. [doi:10.1038/s41467-024-50532-5](https://doi.org/10.1038/s41467-024-50532-5)
## Acknowledgments
- [Evo2](https://github.com/ARC-Institute/evo2) by ARC Institute
- [MaveDB](https://www.mavedb.org/) for DMS data
- Mason Lab, Weill Cornell Medicine
- Computing resources: Cayuga HPC, Cornell University
提供机构:
jang1563



