five

goodarzilab/evo2-clinvar

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/goodarzilab/evo2-clinvar
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - zero-shot-classification pretty_name: ClinVar size_categories: - 100K<n<1M --- # ClinVar Variant Effect Prediction Benchmark ## Dataset Description A curated subset of the NCBI ClinVar database (release: **February 28, 2024**). Each variant includes precomputed scores from Evo 2 and a set of baseline models used in the paper. Variants were filtered to retain only those with a ClinVar final review status of **two gold stars or higher**, ensuring higher-confidence clinical annotations supported by multiple submitters or expert panels. ## Column Descriptions ### Variant Metadata | Column | Description | |---|---| | `variant_id` | Concatenation of the chromosome number and ClinVar's official variant ID. | | `variant_type` | Type of genomic variant (e.g., SNV, deletion, insertion). | | `chrom` | Chromosome number. | | `chrom_refseq_acc` | RefSeq accession for the chromosome. | | `start` | Genomic start position (1-based). | | `stop` | Genomic end position (1-based). | | `strand` | Strand orientation (+ or −). | | `ref_allele` | Reference allele. | | `alt_allele` | Alternate allele. | | `transcript_id` | Transcript affected by the variant. | ### Annotations | Column | Description | |---|---| | `clinical_significance` | ClinVar-assigned clinical significance (e.g., pathogenic, benign, uncertain significance). | | `clinsig` | Simplified clinical significance label used for model evaluation. (P/LP vs. B/LB) | | `review_status` | ClinVar review status indicating the level of supporting evidence. | | `gtf_feature` | GTF feature (CDS, 3'UTR, 5'UTR, or intergenic) encompassing the variant. | | `splice_proximity` | Variant's proximity to the nearest transcript splice site. | ### Model Scores The remaining columns contain precomputed variant effect scores, one per model. These span genomic, RNA, and protein foundation models — including Evo 2 (7B and 40B), Evo, Nucleotide Transformer, CodonBERT, EnCodon, CaLM, ESM-2, ESM-1b, and RNA-FM as well as splice effect predictors (SpliceAI, Pangolin) and other baselines such as AlphaMissense, GPN-MSA, CADD, and phyloP conservation scores (100/241/447/470-way alignments).
提供机构:
goodarzilab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作