five

A comparative analysis of stably expressed genes across diverse angiosperms exposes flexibility in underlying promoter architecture

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.9w0vt4bmk
下载链接
链接失效反馈
官方服务:
资源简介:
Promoters regulate both the amplitude and pattern of gene expression—key factors needed for optimization of many synthetic biology applications. Previous work in Arabidopsis found that promoters that contain a TATA-box element tend to be expressed only under specific conditions or in particular tissues, while promoters which lack any known promoter elements, thus designated as Coreless, tend to be expressed more ubiquitously. To test whether this trend represents a conserved promoter design rule, we identified stably expressed genes across multiple angiosperm species using publicly available RNA-seq data. Comparisons between core promoter architectures and gene expression stability revealed differences in core promoter usage in monocots and eudicots. Furthermore, when tracing the evolution of a given promoter across species, we found that core promoter type was not a strong predictor of expression stability. Our analysis suggests that core promoter types are correlative rather than causative in promoter expression patterns and highlights the challenges in finding or building constitutive promoters that will work across diverse plant species. Methods RNA-seq dataset processing (Relevant files: 0_Slurm_Pipeline) RNA-seq atlases were located in the NCBI Sequence Read Archive (SRA) database. The references for the datasets can be found in Supplemental Table S1. The individual datasets were retrieved using sratoolkit-3.0.1 prefetch followed by fasterq-dump functions. Fastqc-0.11.9 were used to generate a QC report for each dataset. Trimmomatic-0.39 were used for adaptor and low quality ends trimming using the following settings: ‘SLIDINGWINDOW:4:20 MINLEN:36’. ILLUMINACLIP files TruSEq3-PE-2.fa was supplied for paired end data and TruSEq3-SE.fa were supplied for single end data. Reference transcriptome were downloaded from the Ensembl Plants (http://plants.ensembl.org/index.html) for Arabidopsis thaliana, Camelina sativa, Cucumis melo, Glycine max, Phaseolus vulgaris, Pisum sativum, Vigna unguiculata, Sorghum bicolor, Zea mays, Solanum lycopersicum, Actinidia chinensis, Triticum aestivum and Phytozome (https://phytozome-next.jgi.doe.gov) for Arachis hypogaea, Cicer arietinum, and Solanum tuberosum (Cunningham et al., 2021; Goodstein et al., 2012). An index file was generated and the reads aligned and counted using Kallisto-0.44.0 with ‘-o counts -b 500’. For single end data, Fragment Length and Standard Deviation were required, but the information is difficult to locate, and so a default value of ‘-l 200 -s 20’ were used across the board. Another Fastqc was performed on the trimmed files, and a final MultiQC-1.13 were run on the entire folder encompassing all the log files that Fastqc, Trimmomatic, and Kallisto generated. The MultiQC report was inspected to ensure the trimming step improved read quality and there were no major warnings. Normalizing count, Calculating CV and Percent Ranking (Relevant files: 1_Metadata_from_RUNselector.Rmd, 2_MOR_Normalization.Rmd) Using an R script, the raw counts for each species were normalized using the DESeq2 package using a metadata file curated from the original study for the RNA-seq datasets. The coefficient of variation across all samples for a given atlas was used as a metric for stability for each gene, and the percentile ranking for each gene was calculated. The geometric mean for each gene was also calculated across all samples.  Extracting intergenic region and 5’UTR (Relevant files: 3_ExtractPromUTR(ALL_Transcripts).ipynb, 8_ExtractPromUTR(Orthologs).ipynb) Gff3 annotation files and reference genomes were downloaded from Ensembl or Phytozome depending on where the reference transcriptomes were retrieved from. 40% of transcripts were selected from the total transcriptome and their intergenic region and 5’UTR were extracted from the Gff3 annotation. Intergenic region and 5’UTRs of identified orthologs were extracted in a similar manner. Labeling core promoter types (Relevant files: 4_Label_Promoters.Rmd, 9_Motif_Scan.Rmd, 10_Octamer_Scan.ipynb) Motif Scan: Intergenic regions and 5’UTR sequences are trimmed to only regions to be scanned for each core promoter types: TATA box (-100 to TSS), Y patch (-100 to +100), and Inr (-10 to +10).  Intergenic regions shorter than 100bps were excluded from analysis. Each regions were scanned for their respective motifs according using motif files as well as methods outlined in (Jores et al., 2021). A motif is considered to be present when the relative motif scores are above 0.85.  Octamer Scan: Intergenic regions and 5’UTR sequences were trimmed based on the positions relative to the TSS outlined in Yamamoto et al. 2009 (TATA, −45 to −18; Y Patch, −50 to +50; CA, −35 to −1; GA, −35 to +75). Each region was scanned for the presence of octamer motifs from the TATA, Y patch, GA, and CA lists outlined in Yamamoto et al. 2009. If the specified region contained at least one motif for a given promoter type, it was labeled as positive. Ortholog Analysis (Relevant files: 5_At_gene_ranking.Rmd, 6_Identifying_orthologs.Rmd, 7_Processing_orthologs.Rmd) The Arabidopsis transcriptome was filtered to only include primary transcripts, and mitochondria as well as chloroplast transcripts were removed. Top 5% stable genes by CV, bottom 5% stable genes by CV and a random set of 1343 genes (5%) were randomly selected. Using biomaRt in R, the Ensembl and Phytozome databases were queried for orthologs for the selected set of Arabdiopsis genes for each species (Durinck et al., 2009). Orthologs from Arachis hypogaea, Cicer arietinum, and Solanum tuberosum were retrieved from Phytozome, and the rest of the species from Ensembl. For analysis in Figure3B, significance test of done by ANOVA followed by Tukey’s HSD. For each target gene that matched to an Arabidopsis transcript, only the highest expressing transcript was kept. If an Arabidopsis transcript retrieved more than one orthologs from a target species, these pairs of orthologs were removed from analysis. We only kept orthologous gene groups that had a “change” in expression pattern, defined as crossing the 50th percentile CV, in two target species, and the remaining candidates were manually mapped onto the phylogenetic tree to identify gene groups that had changes in expression pattern that are consistent with the tree. This means having changes in expression pattern that are mostly found in the same clade. Gene trees were built for these candidates using blast-align-tree (https://github.com/steinbrennerlab/blast-align-tree) and the candidate lists were further trimmed based on the gene trees to ensure a 1:1 relationship between all members in the gene group. The dataset contains all the necessary scripts to transform the data as described in the manuscript and perform the analysis in the paper.
创建时间:
2023-09-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作