ClinVar Variant Classification Results
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/ClinVar_Variant_Classification_Results/29224097
下载链接
链接失效反馈官方服务:
资源简介:
Available in two formats: CSV and parquet. This CSV file contains predictions from a ClinVarBERT model that classifies genetic variant submissions into three pathogenicity categories: Pathogenic/Likely Pathogenic (P/LP), Variant of Uncertain Significance (VUS), and Benign/Likely Benign (B/LB). The model makes predictions of the variant based on the ClinVar submission summary.
The output CSV contains the following columns:
Identifier Columns
SCV: Submission accession number from ClinVar (format: SCV000000000)
VCV: Variation accession number from ClinVar (format: VCV000000000)
RCV: Record accession number from ClinVar (format: RCV000000000)
VariationID: Numerical identifier for the genetic variation
Genomic Coordinates
GRCh38_Chr: Chromosome number
GRCh38_Start: Start position on chromosome (GRCh38/hg38 assembly)
GRCh38_Stop: Stop position on chromosome (GRCh38/hg38 assembly)
GRCh38_ReferenceAllele: Reference allele sequence
GRCh38_AlternateAllele: Alternate allele sequence
Protein-Level Information
aapos: Amino acid position in the protein
aaref: Reference amino acid (single letter code)
aaalt: Alternate amino acid (single letter code)
gene: Gene symbol (e.g., BRCA1, TP53)
Original Classification
SubmissionClassification: Original classification provided by the submitter mainly includes pathogenic, likely pathogenic, uncertain significance, benign, likely benign, but also includes other values
ClinicalSignificance: Variant-level classification result on ClinVar
Input Text
Comment: The textual comment/description provided with the variant submission that was used as input to the model
Model Predictions
prob_P_LP: Probability score for Pathogenic/Likely Pathogenic classification (0.0 to 1.0)
prob_VUS: Probability score for Variant of Uncertain Significance classification (0.0 to 1.0)
prob_B_LB: Probability score for Benign/Likely Benign classification (0.0 to 1.0)
predicted_label: Final predicted classification based on highest probability Values: "P/LP", "VUS", "B/LB"
Notes on Probability Scores: All three probability scores (prob_P_LP, prob_VUS, prob_B_LB) sum to 1.0 for each row. Higher probability indicates greater model confidence for that classification. The predicted_label corresponds to the classification with the highest probability score.
Additional Column
has_conflicting_submissions: True if the variant has conflicting submissions when different submissions classify it as both benign and pathogenic.
Model Type: Fine-tuned transformer model for sequence classification
Input: Textual comments from variant submissions
Output: Three-class classification (P/LP, VUS, B/LB)
Training Data: ClinVar variant submissions with known classifications
To learn more about how this resource was developed, please view our medRxiv preprint (https://www.medrxiv.org/content/10.1101/2024.12.31.24319792v2). In brief, we first compiled and cleaned a corpus of over one million free-text ClinVar submission summaries, removing non-evidence sentences, filtering out duplicates, and standardizing each record so that only the core evidence supporting a variant’s classification remained. We then fine-tuned a BioBERT-large model, ClinVar-BERT, using these filtered summaries to predict one of three classes: Pathogenic/Likely Pathogenic, VUS, or Benign/Likely Benign, directly from the submission comment text. Finally, to ensure biological relevance, we validated ClinVar-BERT outputs against independent deep mutational scanning datasets for genes such as BRCA1, TP53, and PTEN, and found strong concordance between the model’s three-class probabilities and orthogonal functional measurements (AUROC = 0.927).
创建时间:
2025-06-03



