five

hg38 reference and annotation files

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/5146235
下载链接
链接失效反馈
官方服务:
资源简介:
This repo contains reference and annotation files for hg38. We are following the [TOPMed pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). Reach out to Arushi Varshney at arushiv AT umich DOT edu if you have any questions. Files: 1. bwa index = bwa.tar.gz 2. star index = star.tar.gz 3. ENCODE blacklist = blacklist.tar 4. gencode v30 annotations = gencode.tar.gz 5. containers with STAR (RNA) and BWA (ATAC) = containers.tar.gz Notes on these files: ### hg38 fasta: I downloaded the TOPMed fasta tar [Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) as use this as-is. The TOPMed GitHub describes that they obtained the Broad institute's GRCh38 reference, removed ALT, HLA and Decoy contigs, and added ERCC spike-in reference annotations. Refer to their [README](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for more details. They don't mention PARs but we checked the reference files and both chrY PARs are hard masked - as [ENCODE](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/) also recommends. ### Gencode v30 gene annotations: gencode.tar.gz  I downloaded the file [gencode.v30.annotation.gtf.gz](https://www.gencodegenes.org/human/release_30.html) from the gencode website, and downloaded the file [ERCC92.genes.patched.gtf](https://personal.broadinstitute.org/francois/resources/). I then appended the ERCC patched gtf to the gencode annotation gtf ``` gunzip gencode.v30.annotation.gtf.gz cat gencode.v30.annotation.gtf  ERCC92.genes.patched.gtf > gencode.v30.annotation.ERCC92.gtf ``` ### STAR index: star.tar.gz; container with star in containers.tar.gz A STAR index is shared on the TOPMed GitHub, but it was generated for STAR version STAR_2.6.1d. Since I've been using the version 2.7.3a, I followed their steps to generate the STAR reference again. I used the gencode gtf described above and generated the STAR index. ``` STAR --runMode genomeGenerate  --genomeDir STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_test  --genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta  --sjdbGTFfile gencode.v30.annotation.ERCC92.gtf  --sjdbOverhang 100 --runThreadN 10 ``` ### BWA index: bwa.tar.gz I generated the BWA index using the fasta above ``` ln -s Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta hg38.fa bwa index hg38.fa ``` ### ENCODE Blacklist: blacklist.tar I used the blacklist [here](https://theparkerlab.med.umich.edu/data/arushiv/hg38_references_annots/blacklist/) that I obtained from this [Kundaje website](https://sites.google.com/site/anshulkundaje/projects/blacklists).
创建时间:
2021-07-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作