Transposable element products, functions, and regulatory networks in Arabidopsis thaliana

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10831664

下载链接

链接失效反馈

官方服务：

资源简介：

README This dataset includes the main outputs from the work titled Transposable element products, functions, and regulatory networks in Arabidopsis thaliana. Summary Transposable elements (TEs) are DNA sequences with the ability to propagate themselves within and across genomes. Their mobilization is catalyzed by self-encoded factors, yet these factors have been poorly investigated due to difficulties in defining TE genes in genomes. Here, we leveraged extensive long- and short-read transcriptome data, together with structural predictions, transcription factor binding site identification, and transcriptional network analyses, to construct a comprehensive atlas of TE transcripts and TE-encoded products in the model organism Arabidopsis thaliana. We uncovered hundreds of transcriptionally competent TEs, each potentially encoding multiple proteins either through distinct genes, alternative splicing, or post-translational processing. Structural-based protein analyses revealed dozens of hitherto unidentified domains of unknown function, enabling us to predict proteins with multimerization and DNA binding domains forming macromolecular complexes involved in transposition. Furthermore, we demonstrate that TE expression is highly intertwined with the transcriptional network of cellular genes, and identified transcription factors and cis-regulatory elements associated with their coordinated expression during development or in response to environmental cues. This comprehensive atlas of TE-genes and TE-proteins provides a valuable resource for studying the mechanisms involved in transposition and their consequences for genome and organismal function. File description It includes the following data: annots/TE_Functional_Annotation.Borreda2024.gtf - Annotation file including Arabidopsis TEs and TE-genes. TE-genes defined in our work are indicated in the 'Source' column of the gtf. TAIR10-defined TEs for whom we did not annotate new transcripts are also included. seqs - This folders includes all the transcript sequences (cDNAs.tsv) and the first and longest ORFs found in each of them (prot.csv), which were used for further analyses. The specific copy, gene, isoform and, in the case of proteins, ORF, is indicated for each sequence. structures - The zipped folder full_length_prots_pdbs.zip includes all the 3D structures from full-length TE proteins. Note that identical proteins, which would result in identical structures, have been collapsed to reduce the total dataset size; equivalences can be found in identical_proteins. structures/SD_Cluster_Functions.tsv - We clustered all Structural Domains (SD) based on 3D similarity and assigned a function to each cluster based on the database hits. This table indicated, for each of these SDs, to which cluster it belongs, the superfamily, family and element containing it, the number of Conserved Domains included within it and the number of hits with resolved (retrieved from the RCSB-PDB database) or predicted (AlphaFold2) protein structures. The last column includes the putative function assigned to each cluster. coexpression - Coexpressed genes were classiffied into modules using WGCNA. In the table Gene_Modules.tsv we include, for each gene and TE-gene (provided it has expression in at least one sample, see methods on the publication for details), the TE family and superfamily when applicable and the module to which it belongs. The modules were named based on the results of the GO enrichment analysis of the genes contained. The results of this GO enrichment are included in GO_Enrichment.tsv, where we include the main funciton of the associated GOs, the number of entries and TE-genes within the module, a list of GO terms enriched in that specific module and finally a list of TE families enriched in each module. dapseq - We reanalized the DAP-seq dataset from O'Malley 2016, selecting only TFBS with a binding site within a DAP-seq peak. The list of filtered peaks we found is reported in DAPseq_TFBS_Motifs.tsv. The columns include the coordinates of the TFBS (which have been filtered to fall within a DAP-seq peak and include the TFBS motif), the strand of the motif, the score of the motif reported by FIMO, the motif sequence, the Sequence Read identified for the original DAP-seq data, and the family, name and gene of the TF associated with that specific peak.

创建时间：

2024-03-26