bedylmz/missense-variant-effects
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bedylmz/missense-variant-effects
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- tabular-classification
tags:
- biology
- medical
- genomics
- genetics
- bioinformatics
pretty_name: Genetic Variant Pathogenicity Classification
size_categories:
- 10K<n<100K
---
# Genetic Variant Pathogenicity Dataset
## Dataset Description
This dataset contains annotated genetic variants (mutations) designed for tabular binary classification tasks. The objective is to predict whether a given genetic variant is **Pathogenic** (disease-causing) or **Benign** (harmless) based on a rich set of bioinformatics annotations, evolutionary conservation scores, and functional prediction tools.
- **Task:** Binary Classification
- **Target Column:** `Pathologic/Benign`
## Dataset Structure
The dataset is pre-split into `train` and `test` sets, making it ready for immediate machine learning modeling. The class distribution is highly balanced.
| Split | Number of Rows | Benign Count | Pathogenic Count |
|-------|----------------|--------------|------------------|
| Train | 7,856 | 3,940 | 3,916 |
| Test | 4,910 | 2,460 | 2,450 |
## Key Features
The dataset consists of 69 columns. While it includes extensive biological data, some of the most critical feature categories include:
* **Variant Identifiers:** `Chrom`, `Position`, `Ref Base`, `Alt Base`, `Gene`
* **Molecular Consequences:** `Sequence Ontology` (e.g., *missense_variant*), `cDNA change`, `Protein Change`
* **Population Frequencies:** Allele frequencies from the 1000 Genomes Project and ESP6500.
* **Functional Prediction Scores:** `CADD Exome Score`, `PolyPhen-2`, `SIFT`, `REVEL Score`
* **Conservation Scores:** `GERP++`, `PhyloP`, `SiPhy`
* **Target Label:** `Pathologic/Benign` (Values: "Benign" or "Pathogenic")
## Usage
You can easily load and explore this dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset (replace 'your-username' with your actual Hugging Face username)
dataset = load_dataset("your-username/genetic-variant-pathogenicity")
# View the dataset structure
print(dataset)
# Convert the train split to a Pandas DataFrame for easy manipulation
df_train = dataset['train'].to_pandas()
print(df_train['Pathologic/Benign'].value_counts())
提供机构:
bedylmz



