RemEff: supplementary material

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://figshare.com/articles/dataset/RemEff_supplementary_material/13325246

下载链接

链接失效反馈

官方服务：

资源简介：

Supplementary material for the manuscript: "Remote homology clustering identifies lowly conserved families of effector proteins in plant-pathogenic fungi". Figure 1. The clustering workflow employed in this study. A) Sequences are initially clustered using MMSeqs2 resulting in 3,111,468 level 1 clusters. B) A subset of 286,512 these clusters with any similarity to known effectors are found using HHBlits. C) All sequences from this subset are searched against themselves and reciprocally significant alignments are selected to form a graph. D) Clusters of the initial clusters are found within the resulting graph to form more distant sequence families. In the final graph, each point represents a level 1 cluster resulting from step A, the colours indicate level 2 clusters (markov or greedy clustering), and the whole graph forms a single connected component (level 3 cluster). FIgure 2. Top row, plot of graph coloured by connected components (A), and markov (B) and greedy (C) clusters. Bottom row: the number of unique sequences compared with the number of clusters with that size, within markov (D), greedy (E), and profile clusters (F). For the bottom row, Y-axis values are in binned into 100 evenly sized ranges taken from a 10-based exponential space (100..max(#seqs)). Figure 3. A family of SIX5-like effector sequences. A) The connected component containing the effectors AvrLm6, Bas4, SPD5, and SIX5, coloured by Markov cluster membership (level 2B). B) The same graph, but highlighting the level 1 clusters containing effector sequences and published effector homologues (ALVI*). C) Sequence logos resulting from multiple sequence alignment of all sequences in the connected component (level 3 clusters). Logos for markov clusters with more than 10 members are shown separately. Columns in the multiple sequence alignment with more than 50% gaps are excluded. Figure 4. ToxA-like fungal effector groups. A) The connected components (level 3 clusters) containing ToxA-like and AvrFOM2-like sequences, coloured by Markov cluster membership (level 2B). B) The same graph shown in A, but highlighting level 1 clusters containing known effectors and published effector homologues. C) A multiple sequence alignment constructed from all sequences in the ToxA-like and AvrFOM2-like connected components. Columns in the multiple sequence alignment with more than 50% gaps are excluded. Colours on the y-axis indicate the level 1, 2, and 3 clusters that members belong to, with level 2B (markov) cluster colours matching those in A. Figure 5. A connected component containing RNase-like effectors. A single connected component containing the Ribotoxins and RALPH effectors was observed (A). B) Shows a subset of the connected component containing all level 2 clusters containing effector sequences (C). D) Sequence logos for each level 2B (Markov) cluster from a multiple sequence alignment of all sequences in (B). Colours in the left boxes corresponding to colours in (B). Logos with fewer than 10 members are not shown. Columns in the MSA with greater than 50% gaps are excluded from the visualisation. Supplementary Table 1 Additional genomes and proteomes used for clustering in addition to non-redundant sets from NCBI IPG and Uniparc. Genomes collected from the JGI mycocosm database have a corresponding JGI id. Supplementary Table 2Taxonomic summary of input protein dataset. The sheet “summary” contains the numbers of distinct phyla, classes, orders, families, genera, and species. The sheets “superkingdom”, “phylum”, “class”, and “order” indicate the number of clusters that contain a member of each taxon at the sheet names rank. Supplementary Table 3 The query sequences and summarised search results used to subset the clusters prior to pairwise comparison. Sheet “phibase_effector_selection” contains summarised PhiBase entries, which were curated to identify extracellular proteins in the “selected” column. Sheet “custom_effector_sequences” contains details of validated and hypothetical effector sequences collected from literature, which fills some gaps in the PHIBase dataset. Both the PHIBase sequences and custom effector sequences were used to search for matches against cluster HMMs. Sheet “match_totals” contains the numbers of search matches overall of the selected PHIBase sequences and the custom effector set searched against the database of cluster HMMs. Sheets “matches_unfiltered” and “matches_filtered” contain summaries of evalues, HMM alignment probabilities, alignment lengths, and sequence identity for the raw matches (unfiltered) and for matches filtered to have a maximum e-value of 1e-5 and a minimum alignment length of 15 residues. The “include” column indicates whether the query sequence should be considered to be an effector (either “selected” in the phibase dataset or in the custom dataset). Supplementary Table 4Full list of level 1 clusters, and membership in levels 2a/b and 3 Supplementary Table 5Detailed lowest common ancestor summary for clusters level 1-3 Supplementary Table 6 Results of searching effector sequences against all clustered sequences using JackHMMER, including a comparison with clusters obtained with the RemEff pipeline. Supplementary figure 1.HHBlits alignment score length normalisation. A) Alignment scores show a strong correlation with the product of HMM lengths in a log-log space for the top 10 matches of each query. B) The scores after normalisation show little dependence on the product of HMM lengths. C) The two normalised scores for each pair of a significant match (i.e. A vs B and B vs A) are highly correlated, indicating that the arithmetic mean is a reasonable combination of the scores. Supplementary figure 2.Unfiltered logos for Six5-like group Supplementary figure 3.Unfiltered logos for ToxA-like group Supplementary figure 4.Unfiltered logos for RNase-like effector group Supplementary figure 5.MAX graph and alignment (unfiltered). Supplementary data 1.Multiple sequence alignment of Six5-like group Supplementary data 2.MSA of ToxA like group Supplementary data 3.MSA of RNase effector group Supplementary data 4.MSA of MAX effector group. Supplementary data 5.Logos for all effector clusters Supplementary data 6.FASTA formatted multiple sequence alignments for all effector clusters. Used to generate logos in supplementary data 5. Supplementary data 7.HMMER3 formatted HMMs for all effector clusters corresponding to MSAs in supplementary data 6. Supplementary data 8.Code used for clustering profile HMM-HMM alignments. Supplementary data 9.Multiple sequence alignments of additional potential groupings identified using JackHMMER.

创建时间：

2020-12-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集