five

ClinVar Variant Classification Results

收藏
DataCite Commons2025-11-07 更新2026-02-09 收录
下载链接:
https://figshare.com/articles/dataset/ClinVar_Variant_Classification_Results/29224097/4
下载链接
链接失效反馈
官方服务:
资源简介:
Available in two formats: CSV and parquet. This CSV file contains predictions from a ClinVarBERT model that classifies genetic variant submissions into three pathogenicity categories: Pathogenic/Likely Pathogenic (P/LP), Variant of Uncertain Significance (VUS), and Benign/Likely Benign (B/LB). The model makes predictions of the variant based on the ClinVar submission summary.<br><br>The output CSV contains the following columns:<b>Identifier Columns</b><br>SCV: Submission accession number from ClinVar (format: SCV000000000)<br>VCV: Variation accession number from ClinVar (format: VCV000000000)<br>RCV: Record accession number from ClinVar (format: RCV000000000)<br>VariationID: Numerical identifier for the genetic variation<br><br><b>Genomic Coordinates</b><br>GRCh38_Chr: Chromosome number<br>GRCh38_Start: Start position on chromosome (GRCh38/hg38 assembly)<br>GRCh38_Stop: Stop position on chromosome (GRCh38/hg38 assembly)<br>GRCh38_ReferenceAllele: Reference allele sequence<br>GRCh38_AlternateAllele: Alternate allele sequence<br><br><b>Protein-Level Information</b><br>aapos: Amino acid position in the protein<br>aaref: Reference amino acid (single letter code)<br>aaalt: Alternate amino acid (single letter code)<br>gene: Gene symbol (e.g., BRCA1, TP53)<br><br><b>Original Classification</b><br>SubmissionClassification: Original classification provided by the submitter mainly includes pathogenic, likely pathogenic, uncertain significance, benign, likely benign, but also includes other valuesClinicalSignificance: Variant-level classification result on ClinVar<br><br><b>Input Text</b><br>Comment: The textual comment/description provided with the variant submission that was used as input to the model<br><br><b>Model Predictions</b><br>prob_P_LP: Probability score for Pathogenic/Likely Pathogenic classification (0.0 to 1.0)<br>prob_VUS: Probability score for Variant of Uncertain Significance classification (0.0 to 1.0)<br>prob_B_LB: Probability score for Benign/Likely Benign classification (0.0 to 1.0)<br>predicted_label: Final predicted classification based on highest probability Values: "P/LP", "VUS", "B/LB"<br>Notes on Probability Scores: All three probability scores (prob_P_LP, prob_VUS, prob_B_LB) sum to 1.0 for each row. Higher probability indicates greater model confidence for that classification. The predicted_label corresponds to the classification with the highest probability score.<br>Model Type: Fine-tuned transformer model for sequence classification<br>Input: Textual comments from variant submissions<br>Output: Three-class classification (P/LP, VUS, B/LB)<br>Training Data: ClinVar variant submissions with known classifications<br>To learn more about how this resource was developed, please view our medRxiv preprint (https://www.medrxiv.org/content/10.1101/2024.12.31.24319792v2). In brief, we first compiled and cleaned a corpus of over one million free-text ClinVar submission summaries, removing non-evidence sentences, filtering out duplicates, and standardizing each record so that only the core evidence supporting a variant’s classification remained. We then fine-tuned a BioBERT-large model, ClinVar-BERT, using these filtered summaries to predict one of three classes: Pathogenic/Likely Pathogenic, VUS, or Benign/Likely Benign, directly from the submission comment text. Finally, to ensure biological relevance, we validated ClinVar-BERT outputs against independent deep mutational scanning datasets for genes such as <i>BRCA1</i>, <i>TP53</i>, and <i>PTEN</i>, and found strong concordance between the model’s three-class probabilities and orthogonal functional measurements (AUROC = 0.927).
提供机构:
figshare
创建时间:
2025-11-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作