songlab/deprecated-human_variants

Name: songlab/deprecated-human_variants
Creator: songlab
Published: 2024-01-27 18:21:07
License: 暂无描述

Hugging Face2024-01-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/songlab/deprecated-human_variants

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - dna - variant-effect-prediction - biology - genomics --- # Human variants A curated set of variants from three sources: ClinVar, COSMIC, OMIM and gnomAD. Predictions for methods benchmarked in GPN-MSA paper can be [downloaded from here](https://huggingface.co/datasets/songlab/human_variants/resolve/main/variants_and_preds.parquet). Functional annotations can be [downloaded from here](https://huggingface.co/datasets/songlab/human_variants/resolve/main/functional_annotations.zip). For more information check out our [paper](https://doi.org/10.1101/2023.10.10.561776) and [repository](https://github.com/songlab-cal/gpn). ## Data sources **ClinVar**: Missense variants considered "Pathogenic" by human labelers. **COSMIC**: Somatic missense variants with a frequency at least 0.1% in cancer samples (whole-genome and whole-exome sequencing only). **OMIM**: Regulatory variants considered "Pathogenic" by human labelers, curated in [this paper](https://doi.org/10.1016/j.ajhg.2016.07.005). **gnomAD**: All common variants (MAF > 5%) as well as an equally-sized subset of rare variants (MAC=1). Only autosomes are included. ## Usage ```python from datasets import load_dataset dataset = load_dataset("songlab/human_variants", split="test") ``` Subset - ClinVar Pathogenic vs. gnomAD common (missense) (can specify `num_proc` to speed up): ```python dataset = dataset.filter(lambda v: v["source"]=="ClinVar" or (v["label"]=="Common" and "missense" in v["consequence"])) ``` Subset - COSMIC frequent vs. gnomAD common (missense): ```python dataset = dataset.filter(lambda v: v["source"]=="COSMIC" or (v["label"]=="Common" and "missense" in v["consequence"])) ``` Subset - OMIM Pathogenic vs. gnomAD common (regulatory): ```python cs = ["5_prime_UTR", "upstream_gene", "intergenic", "3_prime_UTR", "non_coding_transcript_exon"] dataset = dataset.filter(lambda v: v["source"]=="OMIM" or (v["label"]=="Common" and "missense" not in v["consequence"] and any([c in v["consequence"] for c in cs]))) ``` Subset - gnomAD rare vs. gnomAD common: ```python dataset = dataset.filter(lambda v: v["source"]=="gnomAD") ```

提供机构：

songlab

原始信息汇总

人类变异数据集

数据集概述

该数据集是从四个来源精心挑选的变异集合：ClinVar、COSMIC、OMIM 和 gnomAD。预测方法的基准可以在这里下载。功能注释可以在这里下载。

数据来源

ClinVar:

被人类标记为“致病性”的错义变异。

COSMIC:

在癌症样本中频率至少为 0.1% 的体细胞错义变异（仅包括全基因组和全外显子测序）。

OMIM:

被人类标记为“致病性”的调控变异，这些变异在这篇论文中进行了整理。

gnomAD:

所有常见的变异（MAF > 5%）以及同等大小的罕见变异子集（MAC=1）。仅包括常染色体。

使用方法

python from datasets import load_dataset

dataset = load_dataset("songlab/human_variants", split="test")

子集筛选

ClinVar 致病性 vs. gnomAD 常见（错义）: python dataset = dataset.filter(lambda v: v["source"]=="ClinVar" or (v["label"]=="Common" and "missense" in v["consequence"]))
COSMIC 频繁 vs. gnomAD 常见（错义）: python dataset = dataset.filter(lambda v: v["source"]=="COSMIC" or (v["label"]=="Common" and "missense" in v["consequence"]))
OMIM 致病性 vs. gnomAD 常见（调控）: python cs = ["5_prime_UTR", "upstream_gene", "intergenic", "3_prime_UTR", "non_coding_transcript_exon"] dataset = dataset.filter(lambda v: v["source"]=="OMIM" or (v["label"]=="Common" and "missense" not in v["consequence"] and any([c in v["consequence"] for c in cs])))
gnomAD 罕见 vs. gnomAD 常见: python dataset = dataset.filter(lambda v: v["source"]=="gnomAD")

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集