goodarzilab/evo2-clinvar
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/goodarzilab/evo2-clinvar
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- zero-shot-classification
pretty_name: ClinVar
size_categories:
- 100K<n<1M
---
# ClinVar Variant Effect Prediction Benchmark
## Dataset Description
A curated subset of the NCBI ClinVar database (release: **February 28, 2024**). Each variant includes precomputed scores from Evo 2 and a set of baseline models used in the paper. Variants were filtered to retain only those with a ClinVar final review status of **two gold stars or higher**, ensuring higher-confidence clinical annotations supported by multiple submitters or expert panels.
## Column Descriptions
### Variant Metadata
| Column | Description |
|---|---|
| `variant_id` | Concatenation of the chromosome number and ClinVar's official variant ID. |
| `variant_type` | Type of genomic variant (e.g., SNV, deletion, insertion). |
| `chrom` | Chromosome number. |
| `chrom_refseq_acc` | RefSeq accession for the chromosome. |
| `start` | Genomic start position (1-based). |
| `stop` | Genomic end position (1-based). |
| `strand` | Strand orientation (+ or −). |
| `ref_allele` | Reference allele. |
| `alt_allele` | Alternate allele. |
| `transcript_id` | Transcript affected by the variant. |
### Annotations
| Column | Description |
|---|---|
| `clinical_significance` | ClinVar-assigned clinical significance (e.g., pathogenic, benign, uncertain significance). |
| `clinsig` | Simplified clinical significance label used for model evaluation. (P/LP vs. B/LB) |
| `review_status` | ClinVar review status indicating the level of supporting evidence. |
| `gtf_feature` | GTF feature (CDS, 3'UTR, 5'UTR, or intergenic) encompassing the variant. |
| `splice_proximity` | Variant's proximity to the nearest transcript splice site. |
### Model Scores
The remaining columns contain precomputed variant effect scores, one per model. These span genomic, RNA, and protein foundation models — including Evo 2 (7B and 40B), Evo, Nucleotide Transformer, CodonBERT, EnCodon, CaLM, ESM-2, ESM-1b, and RNA-FM as well as splice effect predictors (SpliceAI, Pangolin) and other baselines such as AlphaMissense, GPN-MSA, CADD, and phyloP conservation scores (100/241/447/470-way alignments).
提供机构:
goodarzilab



