Enhenced Araport11 annotation
收藏Figshare2025-10-17 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Enhenced_Araport11_annotation/30383395/1
下载链接
链接失效反馈官方服务:
资源简介:
The official <i>Arabidopsis thaliana</i> genome annotation Araport11 provides comprehensive information on genes, noncoding RNAs, and transposable elements (TEs). However, the distributed GFF3 file merges all genomic features, like protein-coding genes, noncoding RNAs, and transposable elements, into a single annotation file without explicit TE classification. Moreover, this lack of classification metadata makes it difficult to extract and analyze specific TE families or subclasses. Consequently, the original format is not optimal for downstream applications such as RNA-seq read alignment, TE expression analysis, or TE–gene association studies.To address this limitation, TE classification information was incorporated into the Araport11 annotation based on the TAIR Transposon Family Database. The revised annotation includes a standardized RepeatMasker and EDTA style classification (classification=subclass/family/clade) tag for each TE entry. The full annotation was also separated into multiple feature-specific files (e.g., protein-coding genes, TE genes, and TEs), providing more flexibility for different genomic analyses such as gene expression quantification or TE characterization.All the scripts used in this process are provided, and the resulting files are available for download.
官方拟南芥(Arabidopsis thaliana)基因组注释工具Araport11提供了关于编码基因、非编码RNA以及转座元件(transposable elements, TEs)的全面注释信息。然而其发布的GFF3格式注释文件将所有基因组特征——如蛋白编码基因、非编码RNA与转座元件(TEs)——合并为单个注释文件,未对转座元件进行显式分类。此外,由于缺乏分类元数据,难以提取并分析特定的转座元件家族或亚类。因此,该原始格式并不适配下游分析应用,例如RNA-seq读段比对、转座元件表达分析或转座元件-基因关联研究。为解决这一局限,本研究基于TAIR转座子家族数据库,将转座元件分类信息整合至Araport11注释体系中。修订后的注释为每一条转座元件条目添加了符合RepeatMasker与EDTA规范的分类标签,格式为`classification=亚类/家族/分支`。同时,完整注释文件被拆分为多个按特征分类的子文件(例如蛋白编码基因、转座元件基因与转座元件文件),为基因表达定量或转座元件特征解析等各类基因组分析提供了更高的灵活性。本流程中使用的所有脚本均已公开,生成的注释文件也可下载获取。
提供机构:
Chen, Heng
创建时间:
2025-10-17



