Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/3737089
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was developed for the following article:
Rogan PK, Mucaki EJ and Shirley BC. A proposed molecular mechanism for pathogenesis of severe RNA-viral pulmonary infections [version 1; peer review: awaiting peer review]. F1000Research 2020, 9:943 (https://doi.org/10.12688/f1000research.25390.1)
Section 1. Extended Data Tables
This archive contains the extended data tables for the research article "A proposed mechanism for molecular pathogenesis of severe RNA-viral pulmonary infections". These tables provide SRSF1, RNPS1 and hnRNP A1 binding site and information-dense cluster counts across various RNA viral genomes [including multiple SARS-CoV-2 and influenza strains] and the human transcriptome, the estimated SARS-CoV-2 doubling time necessary for viral genome SRSF1 binding site availability to exceed sites within the host transcriptome, and an analysis of influenza, dengue, and aplastic anemia patients misdiagnosed as irradiated by established radiation gene signatures.These tables are:
Section 1 - Table 1. RNPS1 and hnRNPA1 binding sites and Information-Dense Clusters for RNPS1 and
hnRNPA1 in RNA Virus Genomes
Section 1 - Table 2A. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 1) in RNA Virus
Genomes
Section 1 - Table 2B. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 2) in RNA Virus
Genomes
Section 1 - Table 2C. Detailed Analysis of Information-Dense Clusters for RNPS1 in RNA Virus Genomes
Section 1 - Table 2D. Detailed Analysis of Information-Dense Clusters for hnRNP A1 in RNA Virus
Genomes
Section 1 - Table 3. Binding Site Analysis of Multiple Coronavirus Strains (Both Strands)
Section 1 - Table 4A. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Negative Strand Only)
Section 1 - Table 4B. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Both Strands)
Section 1 - Table 5. SRSF1, RNPS1 and hnRNPA1 Binding Sites and Information-Dense Clusters by Gene
Section 1 - Table 6A. Transcriptome-Wide Information Dense Clusters Intersecting DRIP- and DRIPc-seq
Intervals
Section 1 - Table 6B. Exome-Wide Information Dense Clusters within DRIP- and DRIPc-seq Intervals
Section 1 - Table 6C. Transcriptome-Wide Scan of Strong Binding Sites Intersecting DRIP- and DRIPc-seq
Intervals
Section 1 - Table 6D. Exome-Wide Scan of Strong Binding Sites within DRIP- and DRIPc-seq Intervals
Section 1 - Table 7. Rate of False Positives for Influenza, Dengue Virus and Aplastic Anemia Using
Radiation Signatures
Section 1 - Table 8. Radiation Model Genes Contributing to False Positives for Patients with Influenza A,
Dengue Virus, and Aplastic Anemia
Section 1 - Table 9A. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding
Sites (Positive-Strand Sites Only)
Section 1 - Table 9B. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding
Sites (Both Strands Considered)
Section 2. All SRSF1, hnRNPA1 and RNPS1 binding site tracks for human and viral genomes
We provide bedgraph tracks which provide the location and strength of binding sites (and binding site clusters) for SRSF1, RNPS1 and hnRNPA1 across the human transcriptome (GRCh37), the human exome (including +/-300nt surrounding the exon; non-intergenic only), and for all viral genome investigated in this study (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [two strains]). Note that if no clusters were found for a particular viral genome, a file for said genome will not be present in the Zenodo archive.
Folder “Cluster-to-DRIPseq-Intersection-Tracks” contain tracks which indicate where binding site clusters have been identified, intersected with DRIP-seq and DRIPc-seq intervals which indicate where there is evidence of R-Loop formation in the human genome. The DRIP-seq dataset (GSE68845) is not strand specific. DRIPc-seq (GSE70189) is strand specific, and has been taken into account in the intersection (e.g. tracks only list positive strand clusters found in positive-strand DRIPc-seq intervals).
Due to sheer size, the human transcriptome and exome tracks which indicate the location of individual binding sites are split into two separate files (separated by strand). While the custom tracks containing human binding site information are designed to be uploaded to the UCSC Genome Browser, files containing transcriptome-wide binding site information may be too large to be uploaded and may require further filtering (i.e. by chromosome).
To be classified as a cluster, binding sites on the same strand must have Ri values which sum to >50 bits, each binding site must have a neighboring site within 25nt, and all binding sites in the cluster must have Ri greater than a minimum bit threshold. For human transcriptomes and exomes, this bit minimum was set to Rsequence. The bit minimum for viral binding sites was set to 0.1 * Rsequence. The information density-based clustering algorithm utilized in this work is described in Lu and Rogan 2018 (https://f1000research.com/articles/7-1933/v2) and archived source code is available through Zenodo (https://dx.doi.org/10.5281/zenodo.1892051).
Section 3. Binding site clusters - lollipop plots
Lollipop plots present the genomic coordinates and information densities of clusters across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]). The height of the "lollipop" corresponds to the information density of a cluster. Labels above "lollipops" present the start and end genomic coordinate (GRCh37) of the cluster followed by the number of sites in the cluster enclosed in brackets. Lollipop plots associated with human transcriptomes/exomes each contain a single gene. Influenza has 8 segments and each segment requires its own plot, other viral genomes examined are presented in a single plot.
File naming convention for human plots:
RBP_Gene.png
e.g. RNPS1_ADK.png
File naming convention for viral plots (elements in square brackets do not always appear):
Virus[.InfluenzaSegment].RiThreshold.Strand.RBP.png
e.g. Wuhan-Hu-1.complete-genome.4.2-bits.PosStrand.hnRNPA1.png
The specified Ri threshold indicates all binding sites which comprise a cluster have Ri greater-than or equal to the threshold.
Section 4. Ri(b,l) matrices for all binding sites scanned
The information theory-based position weight matrices for the following RNA binding proteins (RBP) used in this study: SRSF1, hnRNPA1 and RNPS1. We investigated binding using two different RNPS1 binding models. While similar, these two models contained binding site information on opposing sides of the binding site motif which is why we found it prudent to scan with both models.
Structure of each file:
Line #1: Start position, End position and Rsequence [average strength of sequences used to generate the model]
Subsequent lines describe the information on each position of the binding site:
First four columns: Ri contribution of nucleotide at this position of the matrix [A, C, G, T]
Row #5: Position of the matrix
Last four columns: Number of binding sites used to generate model with a particular nucleotide at this position of the matrix [A, C, G, T]
Example:
-2.965775 1.282153 0.034225 -4.906891 0 1 19 8 0
At zero position of the matrix (first nucleotide), a ‘C’ would have a positive contribution to binding site strength, a ‘G’ would be relatively neutral, and an ‘A’ or ‘T’ would negatively contribute to binding site strength.
Generation of Ri(b,l) matrices and computation of Ri values and can be accomplished by utilizing the Delila package (https://alum.mit.edu/www/toms/delila/delilaprograms.html).
Section 5. Ri and intersite distance - histograms
Two sets of histograms present Ri distribution and intersite distance distribution across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]).
File naming convention for human plots (elements in square brackets do not always appear):
[IntersiteDistancesThreshold-]Human-[DRIPc]-AllChrs-RBP[-RiThreshold].png
e.g. IntersiteDistances500-Human-AllChrs-hnRNPA1-4.6-bits.png
File naming convention for viral plots (elements in square brackets do not always appear):
[IntersiteDistancesThreshold-]Strand-RBP-Virus[.InfluenzaSegment][-RiThreshold].png
e.g. IntersideDistances1000-PosStrandOnly-SRSF1-top50000sitesReplicate1-HIV-1-Strain-B.png
Intersite distance thresholds of 500 or 1000 were assigned for all intersite distance histograms. Any distances above the corresponding threshold were excluded from the plot. Plots presenting Ri distributions contain a dashed line indicating Rsequence if it is visible within the scope of the plot.
Section 6. Perl Scripts and Descriptions
This archive contains all Perl scripts discussed in this archive's associated manuscript and a document file which describes them ("Perl-Script-Descriptions-Page.docx"). The programs and their general functions are as follows:
“ClusterToDRIPseqAnalysisProgram.pl” – reports which information-dense clusters are located within DRIPc- and/or DRIP-seq intervals (individually and by gene)
“ClusterToDRIPseqAnalysisProgram.GeneDensityFinder.pl” – uses the output from script “ClusterToDRIPseqAnalysisProgram.pl” to determine the number and the density of information-dense clusters within a gene (total clusters within the gene and those within DRIPc-seq intervals)
“calculateIntersiteDistance.pl” – determines the distance between all binding sites in the same gene from a list of genomic coordinates
“removeOutliersHigherThanN.pl” – discards intersite distances computed by script “calculateIntersiteDistance.pl” that are greater than a specified threshold
“getStatisticsOnCol.pl” – calculates the count, geometric mean, median, arithmetic mean, and standard deviation of values from the output of script “removeOutliersHigherThanN.pl”
“ScanDataSummaryProgram.pl” – determines the number of binding sites (above a specified Ri threshold) found within known genes (the program also reports the total expression of those genes using external A549 and pneumocyte expression datasets) from binding site coordinate data
“TotalBindingSitePerCellCalculator.pl” – estimates the number of binding sites expressed in a single A549 or pneumocyte cell at any given time.
本数据集为以下学术文章开发:
罗根PK、穆卡基EJ、雪莉BC。严重RNA病毒肺部感染的分子致病机制假说[版本1;同行评议:待评审]。F1000Research,2020,9:943(https://doi.org/10.12688/f1000research.25390.1)
1. 补充数据表格
本存档包含研究论文《严重RNA病毒肺部感染的分子致病机制假说》的补充数据表格。这些表格涵盖了多种RNA病毒基因组(包含多株SARS-CoV-2及流感毒株)与人类转录组中丝氨酸/精氨酸剪接因子1(SRSF1)、RNA结合蛋白S1(RNPS1)及异质性核核糖核蛋白A1(hnRNP A1)的结合位点及信息密集簇计数,使病毒基因组中SRSF1结合位点可用性超过宿主转录组内结合位点所需的SARS-CoV-2倍增时间估算值,以及经已确立的辐射基因特征被误诊为辐射暴露的流感、登革热及再生障碍性贫血患者的分析数据。这些表格如下:
1-表1:RNA病毒基因组中RNA结合蛋白S1(RNPS1)与异质性核核糖核蛋白A1(hnRNP A1)的结合位点及信息密集簇
1-表2A:RNA病毒基因组中丝氨酸/精氨酸剪接因子1(SRSF1)的信息密集簇详细分析(重复1)
1-表2B:RNA病毒基因组中丝氨酸/精氨酸剪接因子1(SRSF1)的信息密集簇详细分析(重复2)
1-表2C:RNA病毒基因组中RNA结合蛋白S1(RNPS1)的信息密集簇详细分析
1-表2D:RNA病毒基因组中异质性核核糖核蛋白A1(hnRNP A1)的信息密集簇详细分析
1-表3:多株冠状病毒毒株结合位点分析(双链)
1-表4A:多株甲型流感病毒(H3N2)结合位点分析(仅负链)
1-表4B:多株甲型流感病毒(H3N2)结合位点分析(双链)
1-表5:按基因分类的丝氨酸/精氨酸剪接因子1(SRSF1)、RNA结合蛋白S1(RNPS1)及异质性核核糖核蛋白A1(hnRNP A1)结合位点与信息密集簇
1-表6A:与DRIP-seq及DRIPc-seq区间相交的全转录组信息密集簇
1-表6B:DRIP-seq及DRIPc-seq区间内的全外显子组信息密集簇
1-表6C:与DRIP-seq及DRIPc-seq区间相交的强结合位点全转录组扫描
1-表6D:DRIP-seq及DRIPc-seq区间内的强结合位点全外显子组扫描
1-表7:使用辐射特征诊断流感、登革病毒及再生障碍性贫血的假阳性率
1-表8:导致甲型流感、登革病毒及再生障碍性贫血患者假阳性结果的辐射模型基因
1-表9A:使病毒基因组SRSF1结合位点可用性超过宿主转录组SRSF1结合位点所需的SARS-CoV-2倍增时间(仅正链位点)
1-表9B:使病毒基因组SRSF1结合位点可用性超过宿主转录组SRSF1结合位点所需的SARS-CoV-2倍增时间(考虑双链)
2. 人类与病毒基因组的所有丝氨酸/精氨酸剪接因子1(SRSF1)、异质性核核糖核蛋白A1(hnRNP A1)及RNA结合蛋白S1(RNPS1)结合位点轨道
我们提供了人类转录组(GRCh37)、人类外显子组(包含外显子上下游300nt区域;仅非基因间区)以及本研究中所有受试病毒基因组(冠状病毒、登革病毒、HIV-1[两株]及流感病毒[两株])中丝氨酸/精氨酸剪接因子1(SRSF1)、RNA结合蛋白S1(RNPS1)及异质性核核糖核蛋白A1(hnRNP A1)结合位点(及结合位点簇)的位置与强度的bedgraph轨道。请注意,若某一病毒基因组未检出任何簇,则Zenodo存档中将不包含该基因组对应的文件。
‘Cluster-to-DRIPseq-Intersection-Tracks’文件夹包含标记结合位点簇识别位置的轨道,该轨道与DRIP-seq及DRIPc-seq区间相交,用于指示人类基因组中存在R环形成证据的区域。DRIP-seq数据集(GSE68845)无链特异性。DRIPc-seq数据集(GSE70189)具有链特异性,且在区间相交分析中已考虑该特性(例如,轨道仅列出在正链DRIPc-seq区间内发现的正链簇)。
由于数据量过大,标记单个结合位点位置的人类转录组与外显子组轨道被拆分为两个独立文件(按链区分)。尽管包含人类结合位点信息的自定义轨道旨在上传至UCSC基因组浏览器,但包含全转录组结合位点信息的文件可能因体积过大无法直接上传,需进一步按染色体进行筛选。
结合位点簇的判定标准如下:同链上的结合位点的Ri值总和需大于50比特,每个结合位点与相邻结合位点的间距需小于25nt,且簇内所有结合位点的Ri值需高于最小比特阈值。针对人类转录组与外显子组,该最小比特阈值设为Rsequence;针对病毒结合位点,该阈值设为0.1×Rsequence。本研究使用的基于信息密度的聚类算法详见Lu与Rogan 2018(https://f1000research.com/articles/7-1933/v2),存档的源代码可通过Zenodo获取(https://dx.doi.org/10.5281/zenodo.1892051)。
3. 结合位点簇——棒棒糖图
棒棒糖图展示了人类转录组、人类外显子组及受试病毒基因组(冠状病毒、登革病毒、HIV-1[两株]及流感病毒[一株])中簇的基因组坐标与信息密度。‘棒棒糖’的高度对应簇的信息密度,‘棒棒糖’上方的标注依次为簇的基因组起始与终止坐标(GRCh37),以及括号内的簇内结合位点数。与人类转录组/外显子组对应的棒棒糖图每张仅包含单个基因;流感病毒有8个片段,每个片段需单独生成一张图,其余受试病毒基因组则合并为单张图。
人类绘图文件命名规则:
RBP_基因.png
例如:RNPS1_ADK.png
病毒绘图文件命名规则(方括号内元素并非必现):
病毒[.流感片段].Ri阈值.链.RBP.png
例如:Wuhan-Hu-1.complete-genome.4.2-bits.PosStrand.hnRNPA1.png
指定的Ri阈值表示构成簇的所有结合位点的Ri值均大于或等于该阈值。
4. 所有扫描结合位点的Ri(b,l)矩阵
本研究针对以下RNA结合蛋白(RBP)构建了基于信息论的位置权重矩阵:丝氨酸/精氨酸剪接因子1(SRSF1)、异质性核核糖核蛋白A1(hnRNP A1)及RNA结合蛋白S1(RNPS1)。本研究使用两种不同的RNPS1结合模型开展结合分析,尽管二者相似,但它们分别包含结合基序两侧的结合位点信息,因此同时使用两种模型进行扫描更为稳妥。
每个文件的结构如下:
第1行:起始位置、终止位置及Rsequence[用于构建模型的序列的平均强度]
后续各行描述结合位点矩阵各位置的信息:
前四列:矩阵对应位置各核苷酸[A、C、G、T]的Ri贡献值
第5行:矩阵的位置
最后四列:矩阵对应位置为特定核苷酸[A、C、G、T]时,用于构建模型的结合位点数
示例:
-2.965775 1.282153 0.034225 -4.906891 0 1 19 8 0
在矩阵的零位置(第一个核苷酸),‘C’会对结合位点强度产生正向贡献,‘G’相对中性,而‘A’或‘T’则会对结合位点强度产生负向贡献。
Ri(b,l)矩阵的构建及Ri值的计算可通过Delila工具包完成(https://alum.mit.edu/www/toms/delila/delilaprograms.html)。
5. Ri值与位点间距离——直方图
两组直方图分别展示了人类转录组、人类外显子组及受试病毒基因组(冠状病毒、登革病毒、HIV-1[两株]及流感病毒[一株])的Ri值分布与位点间距离分布。
人类绘图文件命名规则(方括号内元素并非必现):
[位点间距离阈值-]人类-[DRIPc]-所有染色体-RBP[-Ri阈值].png
例如:IntersiteDistances500-Human-AllChrs-hnRNPA1-4.6-bits.png
病毒绘图文件命名规则(方括号内元素并非必现):
[位点间距离阈值-]链-RBP-病毒[.流感片段][-Ri阈值].png
例如:IntersideDistances1000-PosStrandOnly-SRSF1-top50000sitesReplicate1-HIV-1-Strain-B.png
所有位点间距离直方图均设置了500或1000的距离阈值,超出该阈值的距离将被排除出绘图。展示Ri值分布的直方图中,若Rsequence处于绘图范围内,则会以虚线标注该值。
6. Perl脚本及说明
本存档包含本研究论文中提及的所有Perl脚本,以及一份描述这些脚本的文档文件("Perl-Script-Descriptions-Page.docx")。各程序及其通用功能如下:
"ClusterToDRIPseqAnalysisProgram.pl":报告位于DRIPc-seq和/或DRIP-seq区间内的信息密集簇(按单个区间及基因分类)
"ClusterToDRIPseqAnalysisProgram.GeneDensityFinder.pl":使用脚本"ClusterToDRIPseqAnalysisProgram.pl"的输出结果,计算单个基因内的信息密集簇数量与密度(基因内总簇数及DRIPc-seq区间内的簇数)
"calculateIntersiteDistance.pl":从基因组坐标列表中,计算同一基因内所有结合位点之间的距离
"removeOutliersHigherThanN.pl":剔除脚本"calculateIntersiteDistance.pl"计算得到的、超出指定阈值的位点间距离
"getStatisticsOnCol.pl":从脚本"removeOutliersHigherThanN.pl"的输出结果中,计算数值的计数、几何均值、中位数、算术均值及标准差
"ScanDataSummaryProgram.pl":从结合位点坐标数据中,统计已知基因内符合指定Ri阈值的结合位点数(该程序还可通过外部A549细胞及肺上皮细胞表达数据集,报告这些基因的总表达量)
"TotalBindingSitePerCellCalculator.pl":估算任意时刻单个人A549细胞或肺上皮细胞中表达的结合位点数。
创建时间:
2020-12-11



