Alignment files and phylogenetic trees for ITS, 18S and PetF sequences of the genus Thalassiosira and other diatoms
收藏DataCite Commons2020-09-22 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/Alignment_files_and_phylogenetic_trees_for_ITS_18S_and_PetF_sequences_of_the_genus_Thalassiosira_and_other_diatoms/12783209
下载链接
链接失效反馈官方服务:
资源简介:
Alignment files and phylogenetic trees for ITS, <i>18S</i> and PetF sequences of the genus<i> Thalassiosira</i> and other diatoms associated with the manuscript "THE TRANSFER OF THE FERREDOXIN GENE FROM THE CHLOROPLAST TO THE NUCLEAR GENOME IS ANCIENT WITHIN THE PARAPHYLETIC GENUS THALASSIOSIRA" by Roy, Woehle & LaRoche 2020. File descriptions:*.aln files represent alignments in fasta format*.treefile files represent unrooted phylogenies in newick format For sequences of the current study the abbreviations To (<i>T. oceanica</i>), Tp (<i>T. pseudonana</i>) & Tw (<i>T. weissflogii</i>) are given in the sequence headers, where the additional numbers indicate for the individual strain identifier numbers (e.g., 1001 -> CCMP1001). Considering sequences added from databases, accessions are given in the alignment sequence headers. If sequences were modified further a corresponding note is given on the header line in the alignment file. ‘_R_’ in names indicates for sequences, where the original sequence orientation was adjusted by the mafft alignment tool (option: '--adjustdirection'). In the ‘_notarget’ files representing PetF proteins, the first 48 alignment columns were found to cover the target peptide sequences and were removed for analysis. See Roy, Woehle & LaRoche 2020 (Status: Accepted by the Frontiers in Microbiology journal) for further details. Methods:Standard 3730xl Sanger sequencing (Applied biosystem by life technologiesTM, Carlsbad, Ca) of the petF gene, the 18S rRNA and the ITS1-5.8S-ITS2 region was carried out (Supplementary Table S2). Consensus sequences of ITS1-5.8S-ITS2 per strain were obtained using SEQUENCHER 5.4.1 (https://www.genecodes.com/, MI, U.S.A.). Representatives of the Sanger 18S and petF sequences were determined with an automatic approach as follows: the Sanger raw sequencing results for the 18S and petF were mapped to Thalassiosira reference sequences (18S accessions: AAFD02000029.651044.652838, AGNL01025219.4749.6540, FJ600728.1.1764; petF accessions: YP_874492.1, YP_009093409.1, EJK54785.1) via local BLAST (Version: 2.2.28+, also tested with the more recent version 2.10.1+, ‘blastn -task blastn -evalue 1e-5 -max_target_seqs 1’, ‘blastx -evalue 1e-5 -max_target_seqs 1’, for the 18S and petF respectively; (Altschul et al., 1997) and the first best hit aligned region on each query sequence was extracted. If multiple sequences per species were obtained, they were further combined using the CAP3 assembly tool (Version 02/10/15; Huang and Madan, 1999); the longest resulting sequence exhibiting similarity to one of the reference sequences was picked for phylogenetic reconstructions. In case of petF, the reference sequences were further used to determine the reading frame for translation into amino acid sequences via the EMBOSS transeq tool (Version 6.6.0.0; Rice et al., 2000). The final sequences for phylogenetics are deposit in NCBI and have the following accession numbers: MN809232-MN809243 for the ITS1-5.8S-ITS2, MN807452-MN807463 for the 18S and finally, MN846055-MN846066 for the PetF.<br>The PetF protein sequences from the genus Thalassiosira and related species were obtained from the NCBI databases (including target peptide sequences; May 2018; Supplementary Figure 1). The downloaded sequences were used as query for similarity searches to find additional protein homologs of species annotated as Thalassiosirales in the marine microbial eukaryote transcriptome project (MMETSP; Keeling et al., 2014). Best BLAST hits (e-value cut-off 10-10) were extracted to determine candidate homologs and to define a representative protein sequence set. In some cases, amino acids were removed at the end or the beginning as likely incorrectly predicted N- or C-termini as revealed by multiple sequence alignments (see below). Here, we noticed that some protein sequences were not starting with methionine residues, while covering more than the full length of the NCBI Thalassiosira reference sequences (see above). We trimmed them from the start until the occurrence of the first ‘M’ residue. Further, one ‘X’ residue was trimmed from the C-terminus to adjust it to the other sequence ends for the same genus (See Supplementary Figure 1 for details on trimmed residues). Two distinct petF homologs found for Minutocellus polymorphus were discarded because we were not able to differentiate clearly if they represented two diversified gene copies or contamination by another diatom species (accessions: CAMPEP_0197733592, CAMPEP_0197725580). All sequence alignments were reconstructed using MAFFT tool v7.123b (options: ‘--maxiterate 1000 --localpair’; Katoh and Standley, 2013). First the NCBI homologs were aligned, followed by the gradual addition of homologs derived from MMETSP and from the Sanger sequencing via the ‘--add’ and ‘--addfragments’ alignment options, respectively. Finally, phylogenetic trees were produced from alignments using IQ-TREE 1.5.5 with 1000 non-parametric bootstrap replicates and the ModelFinder function enabled (Inferred model of substitution: WAG+G4; Nguyen et al., 2015). An alternative phylogeny without target peptides was reconstructed after removal of the first 48 alignment columns that were found to cover potential target peptide sequences. The 18S phylogeny of the species sequence set corresponding to the PetF proteins was obtained from the SILVA sequences database (May 2018; Quast et al., 2013) and from sequences provided by the MMETSP project for individual eukaryotic transcriptomes. The 18S of Conticribra weissflogiopsis (Accession: KT347147.1.3358) was additionally included because it was found to be the most closely related to the 18S of T. oceanica CCMP1616 as determined by the SILVA webservice ‘Search and classify’ tool (Min. identity: 0.95; Number of neighbours: 1). Phylogenies for the 18S sequences were reconstructed with MAFFT and IQ-TREE applying the same parameters as for the PetF proteins (Inferred model of substitution: TIM3+F+R2). First, the 18S sequences from databases were aligned with the ‘--maxiterate 1000 --localpair --adjustdirection’ options of MAFFT and Sanger sequencing results were added (MAFFT; options: ‘--addfragments’). The ITS1-5.8S-ITS2 phylogeny was produced directly from the consensus sequences (Inferred model of substitution: HKY+F+G4) with Phaeodactylum tricornutum as the outgroup (Accession: EF553458.1).
提供机构:
figshare
创建时间:
2020-08-10



