five

songlab/deprecated-human_variants

收藏
Hugging Face2024-01-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/songlab/deprecated-human_variants
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - dna - variant-effect-prediction - biology - genomics --- # Human variants A curated set of variants from three sources: ClinVar, COSMIC, OMIM and gnomAD. Predictions for methods benchmarked in GPN-MSA paper can be [downloaded from here](https://huggingface.co/datasets/songlab/human_variants/resolve/main/variants_and_preds.parquet). Functional annotations can be [downloaded from here](https://huggingface.co/datasets/songlab/human_variants/resolve/main/functional_annotations.zip). For more information check out our [paper](https://doi.org/10.1101/2023.10.10.561776) and [repository](https://github.com/songlab-cal/gpn). ## Data sources **ClinVar**: Missense variants considered "Pathogenic" by human labelers. **COSMIC**: Somatic missense variants with a frequency at least 0.1% in cancer samples (whole-genome and whole-exome sequencing only). **OMIM**: Regulatory variants considered "Pathogenic" by human labelers, curated in [this paper](https://doi.org/10.1016/j.ajhg.2016.07.005). **gnomAD**: All common variants (MAF > 5%) as well as an equally-sized subset of rare variants (MAC=1). Only autosomes are included. ## Usage ```python from datasets import load_dataset dataset = load_dataset("songlab/human_variants", split="test") ``` Subset - ClinVar Pathogenic vs. gnomAD common (missense) (can specify `num_proc` to speed up): ```python dataset = dataset.filter(lambda v: v["source"]=="ClinVar" or (v["label"]=="Common" and "missense" in v["consequence"])) ``` Subset - COSMIC frequent vs. gnomAD common (missense): ```python dataset = dataset.filter(lambda v: v["source"]=="COSMIC" or (v["label"]=="Common" and "missense" in v["consequence"])) ``` Subset - OMIM Pathogenic vs. gnomAD common (regulatory): ```python cs = ["5_prime_UTR", "upstream_gene", "intergenic", "3_prime_UTR", "non_coding_transcript_exon"] dataset = dataset.filter(lambda v: v["source"]=="OMIM" or (v["label"]=="Common" and "missense" not in v["consequence"] and any([c in v["consequence"] for c in cs]))) ``` Subset - gnomAD rare vs. gnomAD common: ```python dataset = dataset.filter(lambda v: v["source"]=="gnomAD") ```
提供机构:
songlab
原始信息汇总

人类变异数据集

数据集概述

该数据集是从四个来源精心挑选的变异集合:ClinVar、COSMIC、OMIM 和 gnomAD。预测方法的基准可以在 这里 下载。功能注释可以在 这里 下载。

数据来源

ClinVar:

  • 被人类标记为“致病性”的错义变异。

COSMIC:

  • 在癌症样本中频率至少为 0.1% 的体细胞错义变异(仅包括全基因组和全外显子测序)。

OMIM:

  • 被人类标记为“致病性”的调控变异,这些变异在 这篇论文 中进行了整理。

gnomAD:

  • 所有常见的变异(MAF > 5%)以及同等大小的罕见变异子集(MAC=1)。仅包括常染色体。

使用方法

python from datasets import load_dataset

dataset = load_dataset("songlab/human_variants", split="test")

子集筛选

  • ClinVar 致病性 vs. gnomAD 常见(错义): python dataset = dataset.filter(lambda v: v["source"]=="ClinVar" or (v["label"]=="Common" and "missense" in v["consequence"]))

  • COSMIC 频繁 vs. gnomAD 常见(错义): python dataset = dataset.filter(lambda v: v["source"]=="COSMIC" or (v["label"]=="Common" and "missense" in v["consequence"]))

  • OMIM 致病性 vs. gnomAD 常见(调控): python cs = ["5_prime_UTR", "upstream_gene", "intergenic", "3_prime_UTR", "non_coding_transcript_exon"] dataset = dataset.filter(lambda v: v["source"]=="OMIM" or (v["label"]=="Common" and "missense" not in v["consequence"] and any([c in v["consequence"] for c in cs])))

  • gnomAD 罕见 vs. gnomAD 常见: python dataset = dataset.filter(lambda v: v["source"]=="gnomAD")

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作