five

bedylmz/missense-variant-effects

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bedylmz/missense-variant-effects
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - tabular-classification tags: - biology - medical - genomics - genetics - bioinformatics pretty_name: Genetic Variant Pathogenicity Classification size_categories: - 10K<n<100K --- # Genetic Variant Pathogenicity Dataset ## Dataset Description This dataset contains annotated genetic variants (mutations) designed for tabular binary classification tasks. The objective is to predict whether a given genetic variant is **Pathogenic** (disease-causing) or **Benign** (harmless) based on a rich set of bioinformatics annotations, evolutionary conservation scores, and functional prediction tools. - **Task:** Binary Classification - **Target Column:** `Pathologic/Benign` ## Dataset Structure The dataset is pre-split into `train` and `test` sets, making it ready for immediate machine learning modeling. The class distribution is highly balanced. | Split | Number of Rows | Benign Count | Pathogenic Count | |-------|----------------|--------------|------------------| | Train | 7,856 | 3,940 | 3,916 | | Test | 4,910 | 2,460 | 2,450 | ## Key Features The dataset consists of 69 columns. While it includes extensive biological data, some of the most critical feature categories include: * **Variant Identifiers:** `Chrom`, `Position`, `Ref Base`, `Alt Base`, `Gene` * **Molecular Consequences:** `Sequence Ontology` (e.g., *missense_variant*), `cDNA change`, `Protein Change` * **Population Frequencies:** Allele frequencies from the 1000 Genomes Project and ESP6500. * **Functional Prediction Scores:** `CADD Exome Score`, `PolyPhen-2`, `SIFT`, `REVEL Score` * **Conservation Scores:** `GERP++`, `PhyloP`, `SiPhy` * **Target Label:** `Pathologic/Benign` (Values: "Benign" or "Pathogenic") ## Usage You can easily load and explore this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset (replace 'your-username' with your actual Hugging Face username) dataset = load_dataset("your-username/genetic-variant-pathogenicity") # View the dataset structure print(dataset) # Convert the train split to a Pandas DataFrame for easy manipulation df_train = dataset['train'].to_pandas() print(df_train['Pathologic/Benign'].value_counts())
提供机构:
bedylmz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作