Multiple sequence alignment and gene-species tree of plant SOK proteins
收藏Figshare2025-07-30 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Multiple_sequence_alignment_of_plant_SOK_proteins/28883615/2
下载链接
链接失效反馈官方服务:
资源简介:
This entry includes Supplementary Data for Chapter 5 of the PhD thesis titled "Navigating the Plant Cell: The Function and Evolution of SOSEKI Polar Proteins" by Andriy Volkov (Wageningen University).This dataset contains a multiple sequence alignment of 1086 SOSEKI (SOK) proteins from various plants (SOKs.l-ins-i.fa, FASTA format), as well as references for sequence sources (Supplementaly methods, Microsoft Word document). A reconciled gene-species phylogenetic tree is also presented (SupplFigure3_SOK_gene-species-tree_reconciliated.xml, doubleRecXML format)We compiled a set of SOK sequences from various public databases. Most databases have Pfam/InterPro annotations, which allowed us to obtain proteins containing annotated SOSEKI DIX domains (Pfam ID: PF06136). Where no such annotations existed (<i>Metasequoia glyptostroboides, Lupinus angustifolius and Lycopodium clavatum</i>) we queried the genomes using BLASTP to identify sequences highly homologous to the <i>Arabidopsis </i>SOK1 protein. Since genome sequences are available for few bryophyte species, we supplemented our dataset with additional bryophyte SOK sequences from the OneKP dataset (van Dop et al, 2020). We filtered out sequences that did not match a SOK DIX domain using PFAMScan (Madeira et al., 2024; Mistry et al., 2020), as well as sequences with less than 200 amino acids (median sequence length before filtering: 427 amino acids). The resulting dataset contained 1086 sequences from 199 species.Sequences were aligned using MAFFT L-Ins-2 (Katoh et al., 2017). This multiple sequence alignment (MSA) was used to build a phylogenetic tree using FastTree version 2.1.11 (Price et al., 2009, 2010). No trimming of the input sequences was performed. We used a JTT + gamma substitution model, which Smart Model Selection for PhyML (Lefort et al., 2017) showed as optimal. The resulting tree was used as a starting tree for GeneRax to infer a maximum likelihood species-tree-aware phylogeny. We ran GeneRax with the undated gene tree evolution model (UndatedDL) and optimized the tree topology using SPR moves with a maximum radius of 6. We retrieved a species tree from OpenTree Taxonomy version 3.6 (OpenTreeOfLife et al., 2020). There were a few polytomies in the species tree. We resolved these polytomies for downstream use by grouping pairs of branches into artificial clades. Furthermore, we rerooted the species tree to reflect the monophyletic origin of bryophytes, which is currently the consensus topology of land plant phylogeny (Harris et al., 2022; Harris et al., 2020; Su et al., 2021).
提供机构:
Volkov, Andriy
创建时间:
2025-07-30



