PanTEon Database: A Cross-Kingdom, Automatically Curated Reference for Transposable Elements
收藏Zenodo2026-04-02 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18039746
下载链接
链接失效反馈官方服务:
资源简介:
The PanTEon Database is a freely available collection of almost 240,000 automatically curated transposable element (TE) sequences, spanning animals, plants, and fungi and covering all major TE orders. The database was designed to maximize sequence fidelity, taxonomic diversity, and methodological consistency, making it suitable for both training and benchmarking state-of-the-art TE classification tools.
Data sources and integration
PanTEon integrates TE sequences from multiple complementary resources:
Curated sequences from Dfam (version 3.9)
Automatically curated sequences from APTEdb (Pedro et al., 2021)
Uncurated sequences from Dfam
TE sequences from the Ensembl 2025 release (Dyer et al., 2025)
All uncurated sequences were originally generated using RepeatModeler2 (Flynn et al., 2020) and subsequently automatically curated with MCHelper (Orozco-Arias et al., 2024). Only TEs showing clear evidence of structural completeness and expected length profiles were retained, resulting in a high-confidence dataset suitable for training and benchmarking machine learning models.
Sequence identification
Each TE sequence in the PanTEon Database is assigned a standardized identifier composed of:
A sequence name (either provided by Dfam or systematically assigned by the PanTEon framework),
A three-level classification (Class / Order / Superfamily),
The species of origin.
For example, a TE sequence derived from curated Dfam data is identified as:
>PumCon-1.141#CLASSII/TIR/TC1MARINER @Puma concolor
In this case, PumCon-1.141 is the original family name, the element belongs to the TIR/Tc1–Mariner superfamily, and it was obtained from Puma concolor.
In contrast, a TE sequence that was originally uncurated and subsequently processed and integrated into the PanTEon Database follows the systematic naming scheme:
>PDB00000038#CLASSI/LTR/LARD @Certhia brachydactyla
Metadata and taxonomic context
Additional taxonomic information—such as order, family, phylum, and higher ranks—as well as details about the origin of each sequence, are provided in the accompanying metadata file:
PanTEon_Database_metadata_v.1.6.1.csv
Benchmark edition
To enable fair and reproducible benchmarking of state-of-the-art TE classification tools, a benchmark edition of the PanTEon Database was generated. This version includes only TE superfamilies represented by more than 10 sequences and merges rare or uncommon superfamilies with their closest relatives (see the PanTEon paper for full details).
The benchmark dataset is available as:
PanTEon_Database_v1.6.1_benchmark_edition.fasta
Trained models for PanTEon Inference
This repository also contains pre-trained models corresponding to the different in-built architectures of PanTEon Platform (classification task). These models can be downloaded and used with the PanTEon inference module by specifying their paths via the -d parameter. This release includes models trained on the following datasets: all (entire PanTEon Database v1.6.1), Animalia, Chordata, Arthropods, Plantae, Angiosperms, Fungi, Ascomycota, and Basidiomycota, as well as models for discriminating between TEs and non-TE sequences. PanTEon Platform can be downloaded from the following GitHub repository: https://github.com/simonorozcoarias/PanTEon
Why PanTEon?
By combining broad taxonomic coverage, automated curation, and a standardized nomenclature, the PanTEon Database provides a robust reference resource for:
developing and training machine learning and deep learning models,
benchmarking TE classification tools,
exploring TE diversity across kingdoms.
Funding
Simon Orozco-Arias is supported by a fellowship within the “Generación D” initiative, Red.es, Ministerio para la Transformación Digital y de la Función Pública, for talent attraction (C005/24-ED CV1). Funded by the European Union NextGenerationEU funds, through PRTR.
Toni Galbadón group acknowledges support from the Spanish Ministry of Science and Innovation (grant numbers PID2021-126067NB-I00 and PLEC2023-010225) cofounded by ERDF “A way of making Europe”, as well as support from the Gordon and Betty Moore Foundation (grant number GBMF9742); the Catalan Research Agency (AGAUR) (grant number 2022 INNOV 00065, 2024 PROD 00175 and 2024 PROD 00043); “La Caixa” foundation (grant number LCF/PR/HR21/00737 and CI23-20260); Fundació La Marató de TV3 (202328-31); AECC (PRYGN234923GABA and 290059); Instituto de Salud Carlos III (CIBERINFEC CB21/13/00061- ISCIII-SGEFI/ERDF and DTS25/00141); European Commission, Horizon Europe-HORIZON-MSCA-2023-DN-01-01 (grant number 101168618) and European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement Nº 101226544 (grant number 101227078).
Alexandre R. Paschoal is supported by Fundação Araucária with NAPI Bioinformática (grant number 66.2021) and Brazilian National Research Council (CNPq - grant number 440412/2022-6).
提供机构:
Zenodo
创建时间:
2025-12-29



