DYNA: Disease-Specific Language Model for Variant Pathogenicity

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12116073

下载链接

链接失效反馈

官方服务：

资源简介：

For coding variant effect predictions (VEPs), our approach centers on clinical variant sets specifically related to inherited cardiomyopathies (CM) and arrhythmias (ARM). We utilize a pre-compiled dataset comprised of rare missense pathogenic and benign variants, categorized using a cohort-based approach for diseases such as cardiomyopathy and arrhythmias, as detailed in the previous report by Zhang et al. ClinVar CM and ARM datasets include all missense variants in CM and ARM, respectively, are extracted from ClinVar (Landrum et al.). In the realm of non-coding VEPs, our focus shifts to splicing-related variants, utilizing a dataset from the multiplexed assay for exon recognition by Chong et al., which highlights the significant impact of rare genetic variants on splicing disruptions. Similarly, the ClinVar Splicing dataset, compiled from ClinVar, encompasses all benign sequences and pathogenic variants pertinent to splicing. For the ClinVar CM and ARM datasets, we translate the DNA sequences into protein sequences using the human genome assembly hg38 from https://www.ncbi.nlm.nih.gov/grc/human. We employed the GFF file, MANE.GRCh38.v1.1.ensembl\_genomic.gff.gz from https://www.ncbi.nlm.nih.gov/refseq/MANE, to annotate coding versus non-coding regions for each gene, as only coding DNA sequences are translated into proteins. Additionally, protein domains, cataloged in the Pfam database (Finn et al.), are essential for the functional characterization of proteins. These domains are identified by aligning the translated sequences to known domain structures, thereby facilitating deeper insights into protein function.

创建时间：

2024-08-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集