Data from: The evolution of silicon transport in eukaryotes

Name: Data from: The evolution of silicon transport in eukaryotes
Creator: figshare
Published: 2020-08-25 13:12:25
License: 暂无描述

DataCite Commons2020-08-25 更新2024-07-28 收录

下载链接：

https://figshare.com/articles/Data_from_The_evolution_of_silicon_transport_in_eukaryotes/12410606/1

下载链接

链接失效反馈

官方服务：

资源简介：

Biosilicification (the formation of biological structures from silica) occurs in diverse eukaryotic lineages, plays a major role in global biogeochemical cycles, and has significant biotechnological applications. Silicon (Si) uptake is crucial for biosilicification, yet the evolutionary history of the transporters involved remains poorly known. Recent evidence suggests that the SIT family of Si transporters, initially identified in diatoms, may be widely distributed, with an extended family of related transporters (SIT-Ls) present in some nonsilicified organisms. Here, we identify SITs and SIT-Ls in a range of eukaryotes, including major silicified lineages (radiolarians and chrysophytes) and also bacterial SIT-Ls. Our evidence suggests that the symmetrical 10-transmembrane-domain SIT structure has independently evolved multiple times via duplication and fusion of 5-transmembrane-domain SIT-Ls. We also identify a second gene family, similar to the active Si transporter Lsi2, that is broadly distributed amongst siliceous and nonsiliceous eukaryotes. Our analyses resolve a distinct group of Lsi2-like genes, including plant and diatom Si-responsive genes, and sequences unique to siliceous sponges and choanoflagellates. The SIT/SIT-L and Lsi2 transporter families likely contribute to biosilicification in diverse lineages, indicating an ancient role for Si transport in eukaryotes. We propose that these Si transporters may have arisen initially to prevent Si toxicity in the high Si Precambrian oceans, with subsequent biologically induced reductions in Si concentrations of Phanerozoic seas leading to widespread losses of SIT, SIT-L, and Lsi2-like genes in diverse lineages. Thus, the origin and diversification of two independent Si transporter families both drove and were driven by ancient ocean Si levels. Removal of cross-contamination in MMETSP data sets First, simple repeats in the contig files were soft-masked using Dustmasker v1.0.0 (Morgulis et al. 2006) using the default parameters to eliminate spurious hits at a high percentage identity. Then an “all versus all” sequence comparison at the nucleotide level with LAST v418 (Kiełbasa et al. 2011) with the option “−e64,” retaining only the top-scoring hit between any pair of contigs. Instances of cross-contamination between sequencing projects are identified only by comparison of projects involving different species names, or by comparison of different “unknown” strains. In the case of the MMETSP projects, seven pairs of differently named species were apparently identical or near-identical (Alveolata sp. CCMP3155 and Vitrella brassicaformis; Glenodinium foliaceum and Kryptoperidinium foliaceum; Gloeochaete witrockiana and Gloeochaete wittrockiana; Isochrysis galbana and Isochrysis sp. CCMP1324; Symbiodinium sp. CCMP2430 and Symbiodinium sp. D1a; Undescribed NY07348D and Unknown NY0313808BC1; Unidentified sp. CCMP1205 and Unidentified sp. CCMP2175). Cross-matches between these pairs were therefore discarded. Separate percentage identity distributions for hits between each pair of projects were built. Matching contig pairs were only considered to be hits where hit length was ≥150 nt and/or where the length of the alignment was at least 50% of the length of the shorter of the two contigs. Cross-contaminated hits should represent a peak at 100% identity or slightly lower (depending on sequencing and assembly error), and “true” hits between species should be distributed around a lower percentage identity. Thus, if there are any cross-contaminated hits, there should be a local minimum in the distribution between 100% and the “true” average percentage identity between the two species. For true cross-contamination, the number of hits at 100% identity must be greater than the number at 99% identity. Percentage identity bins should then descend from 99% in increments of 1%, until three consecutive bins are found the lower two of which contain a number of hits greater than or equal to the previous bin. This should represent the threshold between cross-contaminated hits and “true” hits. A threshold was designated as representative of cross-contamination between projects for each percentage identity distribution. Application of the percentage identity threshold to each pair of projects allowed cross-contaminant contigs within a sequencing project to be marked. An exception was made when the reads per kilobase per million mapped reads (RPKM) in one project was ≥10× the RPKM in the other project of the pair. In this case, the contig in the first project is retained, whereas the contig in the second contig is discarded. Alternatively, if the RPKM in the first project is ≥10,000 then this contig is retained. This is because very highly conserved and highly expressed genes are expected to have high percentage identities between species, and in such cases, a 10× RPKM ratio between projects may not be achieved. Contigs above the percentage identity threshold and marked as cross-contaminants are removed from the nucleotide data and also from the predicted protein data. Statistics on the number of contigs removed as cross-contaminants are given in Supplementary Table S5. Keeling PJ Burki F Wilcox HM Allam B Allen EE Amaral-Zettler LA Armbrust EV Archibald JM Bharti AK Bell CJ, et al. 2014. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol. 12:e1001889. Kiełbasa SM Wan R Sato K Horton P Frith MC. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493. Morgulis A Gertz EM Schäffer AA Agarwala R. 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 13:1028–1040. See also: Dataset 2 at https://dx.doi.org/10.6084/m9.figshare.5686984 (choanoflagellate transcriptomes)

提供机构：

figshare

创建时间：

2020-06-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集