bedylmz/missense-variant-effects

Name: bedylmz/missense-variant-effects
Creator: bedylmz
Published: 2026-03-24 07:59:38
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/bedylmz/missense-variant-effects

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - tabular-classification tags: - biology - medical - genomics - genetics - bioinformatics pretty_name: Genetic Variant Pathogenicity Classification size_categories: - 10K<n<100K --- # Genetic Variant Pathogenicity Dataset ## Dataset Description This dataset contains annotated genetic variants (mutations) designed for tabular binary classification tasks. The objective is to predict whether a given genetic variant is **Pathogenic** (disease-causing) or **Benign** (harmless) based on a rich set of bioinformatics annotations, evolutionary conservation scores, and functional prediction tools. - **Task:** Binary Classification - **Target Column:** `Pathologic/Benign` ## Dataset Structure The dataset is pre-split into `train` and `test` sets, making it ready for immediate machine learning modeling. The class distribution is highly balanced. | Split | Number of Rows | Benign Count | Pathogenic Count | |-------|----------------|--------------|------------------| | Train | 7,856 | 3,940 | 3,916 | | Test | 4,910 | 2,460 | 2,450 | ## Key Features The dataset consists of 69 columns. While it includes extensive biological data, some of the most critical feature categories include: * **Variant Identifiers:** `Chrom`, `Position`, `Ref Base`, `Alt Base`, `Gene` * **Molecular Consequences:** `Sequence Ontology` (e.g., *missense_variant*), `cDNA change`, `Protein Change` * **Population Frequencies:** Allele frequencies from the 1000 Genomes Project and ESP6500. * **Functional Prediction Scores:** `CADD Exome Score`, `PolyPhen-2`, `SIFT`, `REVEL Score` * **Conservation Scores:** `GERP++`, `PhyloP`, `SiPhy` * **Target Label:** `Pathologic/Benign` (Values: "Benign" or "Pathogenic") ## Usage You can easily load and explore this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset (replace 'your-username' with your actual Hugging Face username) dataset = load_dataset("your-username/genetic-variant-pathogenicity") # View the dataset structure print(dataset) # Convert the train split to a Pandas DataFrame for easy manipulation df_train = dataset['train'].to_pandas() print(df_train['Pathologic/Benign'].value_counts())

提供机构：

bedylmz

5,000+

优质数据集

54 个

任务类型

进入经典数据集