Datasets associated with the manuscript entitled "CSI-SSU: Phylogenetic contamination screening of genomic datasets, demonstrated on the Protist 10,000 Genomes (P10K) database"
收藏DataCite Commons2026-03-31 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_associated_with_the_manuscript_entitled_Phylogenetic_placement_and_contamination_screening_of_Amoebozoa_genomic_data_from_the_Protist_10_000_Genomes_P10K_Database_/29814947
下载链接
链接失效反馈官方服务:
资源简介:
<b>Background: </b>Genomic data are essential for uncovering the evolutionary history, ecological roles, and diversity of life. Yet, diverse microbial eukaryotes, predominantly unicellular and traditionally referred to as protists, remain critically underrepresented in genomic repositories, limiting our ability to address fundamental questions in eukaryotic evolution. The Protist 10,000 Genomes (P10K) initiative seeks to fill this gap by generating and compiling genomic and transcriptomic data for a wide range of microbial eukaryotes. However, large-scale sequencing efforts face persistent challenges, including contamination and imprecise taxonomic identification, particularly for poorly studied taxa that require specialized taxonomic expertise. To ensure the reliability of these resources, robust and scalable approaches for taxonomic identification and contamination screening are essential.<b>Results:</b> We developed CSI-SSU (https://github.com/AlexTiceLab/CSI-SSU), a command-line tool for Contaminant Sequence Investigation (CSI) that uses small subunit ribosomal RNA (SSU) sequences, chimeric sequence detection, and phylogenetic placement to rapidly identify, retrieve, and classify SSU sequences from eukaryotic genomic-level assemblies. CSI-SSU incorporates a curated SSU reference dataset representing the major known eukaryotic supergroups, with sequences and taxonomic nomenclature derived from the Protist Ribosomal Reference (PR2) database. In addition to detecting contaminant sequences, CSI-SSU enables approximate taxonomic assignment of the target lineage in each assembly, with resolution constrained by the current diversity represented in PR2. To further assess potential bacterial contamination, CSI-SSU employs bacterial BUSCO searches as a proxy. We demonstrate CSI-SSU utility and performance by screening 2,960 genomic-level assemblies spanning a broad diversity of eukaryotes from P10K. CSI-SSU efficiently detected non-target eukaryotic SSU sequences, revealing cross-group contamination. Classifications also corroborated or refined the original taxonomic assignments, with resolution depending on PR2 representation. Bacterial BUSCO searches indicated bacterial contamination. Independent SSU and COI phylogenies of Amoebozoa supported CSI-SSU classifications, highlighting its accuracy and sensitivity.<b>Conclusion:</b> CSI-SSU provides a scalable and reproducible framework for phylogenetically informed contamination screening and taxonomic validation of genomic and transcriptomic data. Coupling phylogenetic placement with contamination detection enabled us to distinguish high-quality P10K datasets from those requiring decontamination or additional sequencing before downstream use. These findings serve as a reference for future analyses and guide further sequencing efforts to expand the taxonomic diversity of microbial eukaryotes at the genomic level. Addressing imprecise taxonomic assignments, contamination, and reproducibility in genomic-level datasets will enhance the value of these resources and facilitate studies illuminating the evolution and diversification of eukaryotic life.<br>
提供机构:
figshare
创建时间:
2025-08-02



