A pangenome-guided manually curated library of transposable elements for Zymoseptoria tritici

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8379980

下载链接

链接失效反馈

官方服务：

资源简介：

A manually-curated TE consensus library generated using a panel of 19 reference genomes for Zymoseptoria tritici1-3 along with reference genome assemblies for the sister species Z. ardabiliae, Z. brevis, Z. pseudotritici, and Z. passerinii4. Methods Putative TE consensus sequences were first obtained by annotating all 23 genome assemblies1–4 with Earl Grey with default settings (v3.0; https://github.com/TobyBaril/EarlGrey)5,6. Consensus sequences generated from each reference genome were clustered using CD-Hit-Est (v4.8.1)7,8 to group sequences with 90% similarity across 80% of the longer sequence length (-n 8 -d 0 -aL 0.8 -c 0.90 -G 0 -g 1 -b 500 -r 1) to reduce redundancy whilst preventing the collapsing of chimeric sequences. Consensus sequences <100bp were removed, as these are unlikely to represent true TE sequences. Each consensus sequence was then subject to manual curation as described by Goubert et al. (2022)9. Briefly, genomic copies of each TE were obtained using a “BLAST, Extract, Extend” process to recover genomic copies from each of the 23 reference genome assemblies with 1,000 flanking bases at either end9,10. For families with >100 BLASTN hits, the 25 longest hits were selected, along with 75 random hits. Multiple alignments were generated for each putative TE family using MAFFT (v7.505) with the --auto flag11. Columns composed of >=80% gaps were removed with T-COFFEE (v13.45.0.4846264)12. Subsequently, all sequence alignments were manually curated to define TE boundaries and remove regions of low conservation and rare insertions. Following manual curation, new majority-rule consensus sequences were generated with EMBOSS (v6.6.0.0) cons13. TE-Aid (https://github.com/clemgoub/TE-Aid/) was used to aid visual inspection and to identify diagnostic features for classification of extended consensus sequences. Following this, TIRs were recorded if present, and nhmmscan (HMMER v3.3.2)14 was used to identify homology to known curated elements in Dfam (v3.7). Combining this information, each TE consensus sequence was manually classified using available information following the naming convention ‘>ZymTri_2023_family_[n]#[Classification]/[Family]’. Consensus sequences classified with low confidence have a ‘?’ added to the name, as well as the string ‘_LowConf’. To reduce redundancy in the final TE library, sequences were clustered to the family-level using the 80-80-80 rule implemented in CD-hit-est9,15 (-d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1). The representative sequence for each cluster was manually selected to select the sequence with the highest classification confidence, also defined as the ‘most intact consensus’. Chimeric sequences erroneously clustered were manually separated to retain sequences for the chimeric TE and the individual elements that generated the chimer. References 1. Badet, T., Oggenfuss, U., Abraham, L., McDonald, B. A. & Croll, D. A 19-isolate reference-quality global pangenome for the fungal wheat pathogen Zymoseptoria tritici. BMC Biol. 18, 12 (2020). 2. Goodwin, S. B. et al. Finished genome of the fungal wheat pathogen Mycosphaerella graminicola reveals dispensome structure, chromosome plasticity, and stealth pathogenesis. PLoS Genet. 7, e1002070 (2011). 3. Plissonneau, C., Hartmann, F. E. & Croll, D. Pangenome analyses of the wheat pathogen Zymoseptoria tritici reveal the structural basis of a highly plastic eukaryotic genome. BMC Biol. 16, 5 (2018). 4. Feurtey, A. et al. Genome compartmentalization predates species divergence in the plant pathogen genus Zymoseptoria. BMC Genomics 21, 588 (2020). 5. Baril, T., Imrie, R. M. & Hayward, A. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. (2022) doi:10.21203/rs.3.rs-1812599/v1. 6. Baril, T., Galbraith, J. & Hayward, A. Earl Grey. (Zenodo, 2023). doi:10.5281/ZENODO.8116025. 7. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). 8. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012). 9. Goubert, C. et al. A beginner’s guide to manual curation of transposable elements. Mob. DNA 13, 7 (2022). 10. Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 1–9 (2009). 11. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). 12. Notredame, C., Higgins, D. G. & Heringa, J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000). 13. Rice, P., Longden, L. & Bleasby, A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000). 14. Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013). 15. Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007).

创建时间：

2023-09-29