five

101 flagellate phylogenomics data

收藏
DataCite Commons2025-06-01 更新2025-01-06 收录
下载链接:
https://figshare.com/articles/dataset/101_flagellate_phylogenomics_data/22148027/2
下载链接
链接失效反馈
官方服务:
资源简介:
Phylogenomics dataset and the generated transcriptomic data for the study of 7 ancyromonads, 14 apusomonads and <i>Meteora sporadica</i> CRO19MET.Markers and supermatrices: phylogenomics_101_flagellates_97171aa.tar.gzRaw transcripts and peptides used for phylogenomics: 22_transcriptomes_brut.tar.gzTranscripts and peptides without cross-contamination due to batch extraction/sequencing: 22_transcriptomes_croco.tar.gzPeptides without bacterial contamination and redundancy. 22_transcriptomes_eukpep.tar.gzSRA in BioProject: PRJNA908224.Detailed explanation, read carefully before using these datasets:The scope of this study was to generate enough conserved phylogenomic markers to solve the species phylogeny of Apusomonadida and Ancyromonadida in the tree of eukaryotes (with the additional inclusion of the incertae sedis protist Meteora sporadica). For that, the original sets of de novo assembled transcripts from Spades (folder 01_transcripts_brut) were translated to proteins using TransDecoder and CD-HIT at 1% identity (folder 02_peptides_brut), and used to fill the phylogenomic dataset using BLASTp. As explained in the main text, they all 22 filled the dataset well (Table S1), and had high percentage of BUSCO completeness (Table S2); including higher than the reference apusomonad genome of Thecamonas trahens. We do not encourage the usage of this data brut sets unless all further analyses can be carefully checked in a case by case basis. Hence, with the aim to provide good quality data to the research community, we implemented a decontamination pipeline discussed below. From the original set of de novo assembled transcripts, CroCo detected most cross-contamination between the 1st sequencing batch (Table S3), which was also the one with more reads; &gt; 10 million reads, compared to &lt; 8 million reads in the 2nd and 3rd batches (Table S2). From the de-cross-contaminated transcripts (folder 03_transcripts_croco), the number of predicted peptides was much larger (from 26.19% to 68.81% more), except for Ancyromonas kenti who had around ten times more transcripts than other species (Table S2). This is because TransDecoder produces multiple peptides per transcript, which might not be real. After removing cross-contamination, the percentage of BUSCO completeness did not decrease for any species. There were some observed differences between taxa, such as apusomonads having more transcripts and peptides than ancyromonads, although it might be irrelevant to scrutinize partial transcriptomic data without genomics data backing up the results. Similarly, the 1st batch provided more transcripts and peptides than the 2nd and 3rd ones, probably because it had more reads to begin with. From that, we proceeded with only the peptides (folder 04_peptides_croco). Then, the supervised cleaning process with BAUVdb (Bacteria, Archaea, eUkaryotes and Viruses; Table S4) detected a low percentage of eukaryotic peptides: from 6.65% in Ancyromonas kenti, up to 17.11% in Fabomonas mesopelagica (folder 05_eukaryotic_peptides). The percentage of BUSCO completeness decreased for the subset with only eukaryotic hits, from only 0.4% in Chelonemonas dolani, up to 15.6% in Mylnikovia oxoniensis (the transcriptome with most peptides). Apusomonas proboscidea, due to being co-sequenced with a stramenopile, had 27% less of BUSCO completeness. On average, 7.2% of completeness decreased after cleaning the data from non-eukaryotic contaminants, which might represent a loss of truly eukaryotic peptides due to the limited taxon sampling of the BAUVdb (Table S2 and S4). Regarding the eggNOG-mapper analysis, only half of the peptides were annotated (55.63% on average), from 48.78% in Mylnikovia oxoniensis, up to 62.07% in Chelonemonas dolani. Altogether, the BUSCO completeness decreased by 4.2% in Chelonemonas geobuk, up to 19.4% in Ancyromonas mediterranea. Overall, we encourage anyone to use the subset of eukaryotic peptides for comparative genomics studies, in which the proteins under study can be easily checked. Since de novo transcriptomes are prone to show artificially duplicated peptides in comparative genomics analyses, we tested the peptide redundancy using CD-HIT to 90% identity. This procedure removed few peptides for most species (6.48% on average), except for the highly duplicated Mylnikovia oxoniensis (~42.1%), as well as for Multimonas media (20.51%), Apusomonas australiensis (15.5%) and Cavaliersmithia chaoae (9.57%). These four apusomonad species from the 1st sequencing batch are the ones with more transcripts and predicted peptides, but as other species from the batch, they have similar number of sequencing reads. As of now, it is not possible to discern between methodological issues or a biological meaning such as genome duplication or high alternative splicing to explain these differences. Interestingly, the BUSCO completeness value was identical for all species. Although the 255 markers for BUSCO are just a small subset of peptides, we suspect this process of reducing redundancy did not remove information, but errors during the processing of the data. We suggest users of this data to use this set (folder 06_eukpep_cdhit90pid) for high-throughput comparative genomics analyses, but always taking into account the information given here. Also, we did not observe any differences in terms of numbers of proteins, percentage of BUSCO completeness, or number of eggNOG annotated peptides between apusomonads and ancyromonads lineages. Neither with marine and freshwater organisms, nor between large and small apusomonads. Interestingly, we found that the subset of only eukaryote peptides reported from ~30% of BUSCO completeness using the bacteria db10 in Chelonemonas geobuk, up to 50% in Mylnikovia oxoniensis; a similar value found in the previously sequenced Thecamonas trahens refseq proteins (47.5%). In future studies, it would be interesting to compare these numbers with genomic data, and see how suited is RNAseq to perform further comparative genomics analyses.
提供机构:
figshare
创建时间:
2024-11-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作