Transposable element annotation in non-model species - on the benefits of species specific repeat libraries using semi-automated EDTA and DeepTE de novo pipelines
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.m0cfxpp3h
下载链接
链接失效反馈官方服务:
资源简介:
Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing underestimates of TE abundance. Here, we describe the semi-automated generation of a de-novo TE library which combines the newly described EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras sp. C115). We assess performance using both genomic and transcriptomic input by five metrics: (i) abundance (ii) composition (iii) fragmentation (iv) age distributions and (v) capture of potential horizontally transferred TEs. We identified notable differences in these metrics between different TE libraries, and highlight how library choice can have a major impact on TE content estimates in non-model species.
This repository incorporates six raw (unparsed) Repeat Masker (RM) output files for two genomes (Corydoras sp. c115 and Corydoras maculifer) one transcriptome (C. maculifer), two Repeat Libraries (one based on the RepBase Danio rerio library and one de novo library build on the C. sp. c115 genome). The RM ouput files correspond to one homology based transposon search using the D. rerio library and one species specific search using the de novo library. It also includes a script to acompany horizontal transfer analysis and a transposable element renamins script.
Methods
A ‘de-novo’ TE library was generated for the C. sp. C115 genome using the Extensive de-novo TE Annotator (EDTA) (Ou et al., 2019) set to the ‘others’ species parameter. We utilised the inbuilt RepeatModeller (Smit & Hubley, 2008) support which identifies any remaining TEs which might have been overlooked by the EDTA algorithm (--sensitive 1). Classifications within this library were refined using DeepTE using the predefined metazoan model parameter setting (-m) (Yan et al., 2020). TE identification was performed using RepeatMasker (RM; version 1.332) utilising the NCBI/RMBLAST (version 2.6.0+) search engine. This analysis was conducted either against the D. rerio Repbase (2018-10-26) entry, which was also run through DeepTE (to allow for uniformity in TE classification), or the Corydoras-specific library. RM was run under the most sensitive (-s) parameter setting in all instances. The genomic and transcriptomic RM output files were subsequently parsed through a custom R script which (i) removed non-distinct elements by retaining repeats which had a higher scoring match whose domain partly include the domain of another match, (ii) removed repetitive elements not classed as TEs (e.g. microsatellites, simple repeats & sRNAs), (iii) merged elements found on the same contig if they had the same name, orientation, and their combined sequence length was less than or equal to the corresponding reference sequence in RepBase and (iv) removed merged repeats with a length less than 80 base pairs. Additionally, for transcriptomic data, if multiple identical repeats were found across different transcript isoforms, only one was retained. This was to ensure that each repeat represented a unique genomic locus. This script is publicly available from https://github.com/clbutler/RM_TRIPS."
Additional scripts describe a horozontal transfer of transposible elements analysis included in the acompanying manuscript.
创建时间:
2021-03-24



