five

maxATAC Data

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6761767
下载链接
链接失效反馈
官方服务:
资源简介:
Abstract Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built “maxATAC”, a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the first collection of high-performance TFBS prediction models for ATAC-seq.  Repository Overview This repository contains all of the processed training data used by maxATAC for model training and benchmarking. All directories have the extension .tar.gz . In this repository you will find the directories: ATAC_Peaks: ATAC-seq peak files called with MACS2. These files are generated for the hg38 reference genome. The files are have the extension .bed.gz. ATAC_Signal_File: ATAC-seq signal file. This file has been read-depth normalized and min-max normalized between 0,1 using the 99th percentile max value. These files are presented as bigwig files with a .bw extension. ChIP_Binding_File: ChIP-seq signal tracks. These files are the binary signal tracks in bigwig format that are found in the ChIP_Peaks directory. ChIP_Peaks: ChIP-seq peaks files. This directory contains the ENCODE IDR peak sets and peak sets created in the maxATAC publication. These files have the extension .bed.gz. Full_Models: Current set of 127 maxATAC TF models. This directory includes the information for thresholding and the .h5 model files. hg38: This directory includes the hg38 reference genome information that was used in this publication. Prediction_and_Benchmarking: This directory contains all of the predictions for chr1 used for benchmarking in a round-robin training approach. Tn5_CutSites: This directory contains the Tn5 cut sites that have been shifted +4 on the (+) strand and -5 on the (-) strand. The cut sites were then slopped 20 bp using bedtools slop. These files are presented as bed files that have been bzipped. Each file represents an individual biological replicate. scATAC: This directory includes data used for scATAC-seq based predictions.   For additional details please see the maxATAC GitHub Repository and bioRxiv pre-print.

摘要 转录因子(Transcription Factor, TF)是解读基因组的核心分子,本质上建立了DNA序列与不同细胞类型中基因表达之间的关联。明确转录因子结合染色质的方式、位置与时机,将推动我们对基因调控网络与细胞行为的理解。2017年ENCODE-DREAM体内转录因子结合位点(Transcription-Factor Binding Site, TFBS)预测挑战赛凸显了染色质可及性数据对TFBS预测的重要价值,确立了基于DNase-seq的TFBS预测前沿方法。然而,近年兴起的转座酶可及性染色质测序(Assay-for-Transposase-Accessible-Chromatin sequencing, ATAC-seq)已超越DNase-seq,成为当前应用最广泛的染色质可及性表征分析技术。此外,ATAC-seq是目前唯一可通过标准商业平台实现单细胞分辨率的此类检测技术。尽管ATAC-seq数据集呈指数级增长,但遗憾的是,基于ATAC-seq的TFBS预测最常用的方法仍是效果欠佳的基序扫描。为使科研社区能够便捷获取基于ATAC-seq的前沿TFBS预测工具,本研究(1)构建了适用于ATAC-seq模型训练的大规模基准数据集(涵盖127种转录因子);(2)开发了"maxATAC"——一套易用的深度神经网络模型套件,可基于任意细胞类型的ATAC-seq数据进行全基因组范围的TFBS预测。本套件包含127种人类转录因子的预测模型,是首个针对ATAC-seq的高性能TFBS预测模型集合。 仓库概览 本仓库包含maxATAC用于模型训练与基准测试的全部已预处理训练数据,所有目录均为.tar.gz压缩格式。 本仓库包含以下目录: ATAC_Peaks:通过MACS2峰识别算法得到的ATAC-seq峰文件,均基于hg38参考基因组构建,文件格式为.bed.gz。 ATAC_Signal_File:ATAC-seq信号文件。该文件已进行测序深度标准化,并以第99百分位最大值为基准进行了0至1区间的最小-最大归一化,格式为.bw格式的bigwig文件。 ChIP_Binding_File:ChIP-seq信号轨道文件,即ChIP_Peaks目录中提供的bigwig格式二值化信号轨道文件。 ChIP_Peaks:ChIP-seq峰文件目录,包含ENCODE的IDR峰集以及maxATAC研究论文中生成的峰集,文件格式为.bed.gz。 Full_Models:当前的127个maxATAC转录因子模型集合目录,包含阈值设置相关信息与.h5格式的模型文件。 hg38:本研究使用的hg38参考基因组相关信息目录。 Prediction_and_Benchmarking:包含基于循环训练策略进行基准测试所用的1号染色体全部预测结果的目录。 Tn5_CutSites:Tn5酶切割位点文件目录。这些位点已在正链上偏移+4 bp、负链上偏移-5 bp,并通过bedtools slop工具扩展了20 bp的区间,文件为经bzip2压缩的.bed格式,每个文件对应一个独立的生物学重复样本。 scATAC:用于基于scATAC-seq的预测分析的数据目录。 如需获取更多细节,请访问maxATAC的GitHub仓库与bioRxiv预印本。
创建时间:
2022-06-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作