Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.34tmpg4qn

下载链接

链接失效反馈

官方服务：

资源简介：

Background Enhancers belong to the class of non-coding cis-regulatory elements that play a vital role in transcriptional regulation. Mutations in enhancers effect gene regulation and can lead to various disease phenotypes. This has led to an increased interest in identifying enhancers and evaluating the impact of mutations on the enhancers’ activities. However, in contrast to protein-coding intervals, enhancers lack a stereotyped sequence composition. Therefore, the computational prediction of enhancers and their tissue-specificity has remained challenging. Consequently, enhancers are typically predicted based on certain chromatin features, including DNA accessibility, post-translational modifications of histones, and transcription factor (TF) binding. Although these features correlate with enhancer regions, they are only imperfect predictors. Results The present study reports a sequence-based computational model that employs combinatorial TF genomic occupancy as principal determinant to predict tissue-specific enhancers. This model was trained on different data sets including the Encyclopedia of DNA Elements (ENCODE) based DNA accessibility data, Vista enhancer browser based in vivo experimental data, and phylogenetic foot-printing of binding motifs. The application of this novel computational scheme has enabled the prediction of 25,000 forebrain specific cis-regulatory modules (CRMs) in human genome. These predicted CRMs were subjected to validation phase by using ENCODE based enhancer-associated biochemical features, GWAS-based disease associated SNPs and in vivo analysis in zebrafish. Conclusion Validation data revealed that this new computational model is suitable for predicting less well-conserved tissue-specific enhancers regions that are devoid of characterized chromatin features, and therefore is able to complement and facilitate experimental approaches in tissue-specific enhancer discovery. Methods Based on heterotypic cooperativity of transcriptional factors (TFs), in this work we present a high-throughput workflow for the prediction of tissue specific enhancers at a genome-wide scale. The study is composed of two phases. Phase 1: Prediction of key TFs likely to have a role in forebrain development. Phase 2: Devising a pipeline for the prediction of tissue specific enhancers by employing the heterotypic cooperativity among TFs (curated from phase 1). In phase 1, we aimed to pinpoint a set of TFs that have combinatorial genomic occupancy and play a significant role during human forebrain development and differentiation processes. In initial scrutiny, through extensive literature survey, we manually curated a library of 93 TFs relevant to human forebrain-tissue. The library of 93 TFs was then subjected to two different strategies namely (I) Motif discovery through statistical over-representation and continuous tag sequence density estimation through DNase hypersensitive sites (DHSs) map, (II) Regular expression based algorithm. In strategy-1, first step was to predict candidate transcription factor binding sites (TFBSs) of these 93 TFs computationally. Hence, we used two different programs called CLOVER and F-seq. CLOVER is a method to screen a set of DNA sequences against a precompiled library of motifs and select the motifs which are statistically over-represented in the sequences. In Clover, binding profiles of 93 TFs (collected from JASPAR) were searched on 104 forebrain specific human enhancers (FSHEs) (positive control) from Vista enhancer browser, and a set of 100 human non-coding non-conserved sequences (NCNCSs) (negative control) from UCSC genome browser. In addition to that, F-Seq is a software package that generates a continuous tag sequence density estimation to predict binding sites. In F-seq, DNase I hypersensitive sites (DHSs) data of three independent cell lines from ENCODE namely GM12878 (Lymphocytes), Cerebrum_Frontal_OC (Cerebrum), and Frontal_Cortex_OC (Cortex) were used to predict DNase I hypersensitive region of chromatin on FSHEs (positive control) and NCNCSs (negative control). CLOVER predicted binding sites and F-seq predicted binding sites regions were then overlapped for the accurate prediction of TFBSs. Furthermore, to categorize cluster of co-occurring TFBSs in human forebrain specific enhancers, we applied principal component analysis (PCA). In the second strategy, we employed phylogenetic foot-printing based approach to predict the evolutionary conserved binding sites of 93 forebrain TFs (curated from literature) in FSHEs (positive control) and NCNCSs (negative control). For this purpose, we designed a regular expression based algorithm, that aligns human and mouse FSHEs (positive control) and NCNCSs (negative control) orthologue sequences. The designed algorithm pinpoints TRANSFAC based binding sites (for 93 forebrain TFs) that are conserved among human-mouse orthologous sequences. Evolutionary conserved predicted binding sites were then subjected to PCA to identify the cluster of conserved TFBSs in each of the FSHEs (positive control) and NCNCs (negative control). PCA results from strategy 1 and strategy II were compared and as a result 23 TFs were found common and shortlisted for further analyses. The 23 shortlisted TFs were then employed to predict the clusters of co-occurring motifs that might serve as forebrain cis-regulatory modules. For this purpose, we used three human genomic regions that are enriched of forebrain related (development/expression) protein coding genes. These genomic segments were scanned for 23 TFs by using purposely designed sliding window brute force search algorithm (SDBFSA). Conclusively, we found that a subset of 6/93 TFs namely FOXP2, OTX1, OTX2, GATA3, HES5 and NGN2 share a relatively higher heterotypic as well as homotypic binding sites preferences. The set of 6TFs were subjected to phase 2 where we adopted the combinatorial code to predict the clusters of these 6 TFs at genome wide-scale. For this purpose, a specialized Perl script was designed and applied on repeat and exon masked sequence of human genome (GRCh37). As a result of it, we identified the clusters of TFBSs that were closely spaced at the spacer distance of 250bps. Each of the identified cluster contained binding sites for at least 5 of the 6 distinct TFs. We termed each such cluster as an independent cis-regulatory module (CRMs) and assigned a unique identifier called “crm_id” to it. Resultantly, we compiled a catalogue of 25,000 forebrain-specific CRMs which were composed of concentrated clusters of recognition motifs for 6 core TFs (details mentioned in dataset and README file). These predictions were validated by using ENCODE based enhancer-associated biochemical features, GWAS-based disease- or trait-associated SNPs and in vivo functional analysis in zebrafish. Our devised computational workflow is suitable for predicting less well-conserved tissue-specific enhancers that are devoid of characterized chromatin features, and therefore is able to complement and facilitate experimental approaches in tissue-specific enhancer discovery.

创建时间：

2024-10-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集