five

songlab/deprecated-full-gnomad

收藏
Hugging Face2024-01-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/songlab/deprecated-full-gnomad
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - dna - variant-effect-prediction - biology - genomics --- # gnomAD variants and GPN-MSA predictions For more information check out our [paper](https://doi.org/10.1101/2023.10.10.561776) and [repository](https://github.com/songlab-cal/gpn). ## Querying specific variants or genes - Install the latest [tabix](https://www.htslib.org/doc/tabix.html): In your current conda environment (might be slow): ```bash conda install -c bioconda -c conda-forge htslib=1.18 ``` or in a new conda environment: ```bash conda create -n tabix -c bioconda -c conda-forge htslib=1.18 conda activate tabix ``` - Query a specific region (e.g. BRCA1), from the remote file: ```bash tabix https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz 17:43,044,295-43,125,364 ``` The output has the following columns: | chrom | pos | ref | alt | GPN-MSA score | and would start like this: ```tsv 17 43044304 T G -5.10 17 43044309 A G -3.27 17 43044315 T A -6.84 17 43044320 T C -6.19 17 43044322 G T -5.29 17 43044326 T G -3.22 17 43044342 T C -4.10 17 43044346 C T -2.06 17 43044351 C T -0.33 17 43044352 G A 2.05 ``` - If you want to do many queries you might want to first download the files locally ```bash wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz.tbi ``` and then score: ```bash tabix scores.tsv.bgz 17:43,044,295-43,125,364 ``` ## Large-scale analysis `test.parquet` contains coordinates, scores, plus allele frequency and consequences. Download: ``` wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/test.parquet ``` Load into a Pandas dataframe: ```python df = pd.read_parquet("test.parquet") ```
提供机构:
songlab
原始信息汇总

gnomAD variants and GPN-MSA predictions

数据集概述

该数据集包含gnomAD变异数据及其GPN-MSA预测结果。

数据查询

安装tabix

  • 在当前conda环境中安装: bash conda install -c bioconda -c conda-forge htslib=1.18

  • 在新conda环境中安装: bash conda create -n tabix -c bioconda -c conda-forge htslib=1.18 conda activate tabix

查询特定区域

  • 远程查询特定区域(例如BRCA1): bash tabix https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz 17:43,044,295-43,125,364

    输出包含以下列: | chrom | pos | ref | alt | GPN-MSA score | 示例输出: tsv 17 43044304 T G -5.10 17 43044309 A G -3.27 17 43044315 T A -6.84 17 43044320 T C -6.19 17 43044322 G T -5.29 17 43044326 T G -3.22 17 43044342 T C -4.10 17 43044346 C T -2.06 17 43044351 C T -0.33 17 43044352 G A 2.05

本地查询

  • 下载文件到本地: bash wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/scores.tsv.bgz.tbi

  • 本地查询特定区域: bash tabix scores.tsv.bgz 17:43,044,295-43,125,364

大规模分析

test.parquet文件包含坐标、评分、等位基因频率和后果。

  • 下载文件: bash wget https://huggingface.co/datasets/songlab/gnomad/resolve/main/test.parquet

  • 加载到Pandas数据框: python df = pd.read_parquet("test.parquet")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作