Phylogenomic analyses shed lights into the adaptation to aquatic environments in Alismatales
收藏DataCite Commons2024-10-10 更新2024-11-06 收录
下载链接:
https://figshare.com/articles/dataset/Integrating_transcriptomes_to_investigate_genes_associated_with_adaptation_to_water_environments_and_assess_phylogenetic_conflict_and_whole_genome_duplications_in_Alismatales/16967767/7
下载链接
链接失效反馈官方服务:
资源简介:
- The folder '1_all_cds_pep' contains the CDS and PEP sequences for the 95 samples from Alismatales and outgroups. (please contact Lingyun Chen lychen83@qq.com for these sequences)
- '2_MO_alignment_trees' contains the 1005 nuclear orthologs, alignment, concatenated ML tree, and ASTRAL tree.
- '3_chloroplast_alignment_trees' contains chloroplast genes for 92 samples, alignment, concatenated ML tree, and ASTRAL tree.
- '4_3492_extracted_clades' contains the 3492 clusters, which were used for whole genome duplication analyses. The 3492 clusters were mapped to ASTRAL species tree to count the number of duplicated genes at each node.
- '5_phylogenetic_conflict' contains data related to phylogenetic conflict analyses.
- '6_divergence_time' contains a data matrix for BEAST analyses and output
- 'alismatales_40genes_1000M_generations_fix_hypothesis1.xml' is the input for BEAST
- 'alismatales_40genes_1000M_generations_fix_hypothesis1.10percent.tre' is the summary tree generated with TreeAnnotator. It is also the tree in Supplementary Fig. S4
- '7_whole_genome_duplication' contains the ks values used for Ks plot
- '8_gene_evolution_ko_analyses' contains the sequence alignment and phylogenetic trees of gene families. It also included a matrix containing the KEGG information.
- 'gene_orthologs' contains the alignments and individual gene trees:
- '9_Plant_Photos_Figshare' contains original plant photos.<br>
'mafft_file_for_tree_build' - 20 alignment files for tree building
'ortholog_tree' files of the final gene families trees
- 'KEGG_orthologs' contains the numbers of gene annotation with the specific KEGG ortholog:
'matrix_for_enrichment_test.tsv' - gene copy number matrix for all 4687 KOs and 95 species
- 'taxon_list_nov52021' contains abbreviations of species names and full names
<br>
Analyses for folders startwith 1, 2, 3, and 4 following methods at https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/
<br>
Instruction for analyses on phylogenetic conflict, corresponding to the folder '5_phylogenetic_conflict'
5.1. Quartet Sampling analyses (in folder: /5_phylogenetic_conflict/5.1_fig_s2_alismatales_input_quartet_sampling)
quartet_sampling.py --tree MO_1005_astral_speciestree --align ortholog_MO_1005_concatinated.fa.phy --reps 100 --threads 6 --lnlike 2quartet_sampling.py --tree MO_1005_astral_speciestree --align ortholog_MO_1005_concatinated.fa.phy --reps 100 --threads 6 --lnlike 2
<br>
5.2. PhyloNet (in folder: /5_phylogenetic_conflict/5.2_fig3_phylonet)
The folder '5.2_fig3_phylonet' contains two kinds of files. The files endwith '.nex' are the input file, while the files endwith 'output' are the output.
Excute command for each of input files.
java -jar -Xmx140G /PATH_TO_PHYLONET/PhyloNet_3.8.2.jar .nex > _output
<br>
5.3. Consel (in folder: /5_phylogenetic_conflict/5.3_fig4a_all_consel_feb19/estimate_site_wise_log_likelihood_values)
First, estimate the branch length (in folder: /5.3_fig4a_all_consel_feb19/estimate_site_wise_log_likelihood_values)
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology1.tre -n $a"_output_topology1.cons" -p 123456 -N 10 -o atri291; done &
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology2.tre -n $a"_output_topology2.cons" -p 123456 -N 10 -o atri291; done &
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology3.tre -n $a"_output_topology3.cons" -p 123456 -N 10 -o atri291; done &
<br>
The above commands generate 3 tree files endwith '.cons'. Combine the three files, and change the file name, '*_output_3topologies'
<br>
Then, estimate site-wise log-likelihood values
for a in *alismatales.fa.aln-cln
do raxml -T 16 -f G -z $a'_output_3topologies' -s $a -r $a’_output_3topologies’ -m GTRGAMMA -n $a"_sitelh"
done
<br>
Use the files 'RAxML_perSiteLLs*sitelh' generated from the last step for further analyses.
Change the file names generated from last step, as the input of seqmt and makermt need to have the same same, only the extension latter need to vary. For example, change 'RAxML_perSiteLLs.cluster939_1.ortho.fa_alismatales.fa.aln-cln.sitelh' to 'cluster939_1.ortho.fa_alismatales.fa.aln-cln.sitelh'.
<br>
Then excute
for filename in $(ls *.sitelh); do seqmt --puzzle $filename; done
for filename in $(ls *.aln-cln); do makermt $filename; done
for filename in $(ls *.rmt); do consel $filename; done
for filename in $(ls *.pv); do catpv $filename > $filename.out; done
<br>
Use shell scripts to extract the scores in the files endwith *out, and generate a file similar to 'Congruent_au_test.csv'
Execute python au.py. This command will show the results
<br>
5.4. Counting RAxML likelihood scores (in folder: /5_phylogenetic_conflict/5.4_fig4b_likelihood_raxml_output)
Extract the ML cores from the 'RAxML_info.RAxML_bestTree*' file, using command:
for i in $(ls RAxML_info*);
do
echo $i >> ../all_ln_consel.txt
grep "Tree" $i >> ../all_ln_consel.txt
done
<br>
Then, use more shell scripts to generate a file with a format similar to 'all_ln_consel.txt'. This process can be assisted by a script 'extract_raxml_infor_ln.py'
Execute python ln_counts.py. The command will show the results
<br>
5.5. Trees used to count support for the three hypotheses of Alismatales (in folder: /5_phylogenetic_conflict/5.5_fig4c_798trees)
<br>
5.6. Output from the polytomy test (in folder: /5_phylogenetic_conflict/5.6_polytomy_test)
<br>
<br>
If you have any questions, please do not hesitate to contact me lychen83@qq.com
<br>
Lingyun Chen
<br>
- 文件夹`1_all_cds_pep` 包含泽泻目(Alismatales)及外类群共95个样本的编码序列(CDS,Coding Sequence)与多肽序列(PEP,Peptide Sequence)。如需获取上述序列,请联系陈凌云(邮箱:lychen83@qq.com)
- `2_MO_alignment_trees` 包含1005个核直系同源基因、序列比对文件、串联最大似然(ML,Maximum Likelihood)树以及ASTRAL物种树
- `3_chloroplast_alignment_trees` 包含92个样本的叶绿体基因、序列比对文件、串联最大似然树以及ASTRAL物种树
- `4_3492_extracted_clades` 包含用于全基因组复制分析的3492个基因簇。该3492个基因簇被映射至ASTRAL物种树,以统计每个节点处的重复基因数目
- `5_phylogenetic_conflict` 包含与系统发育冲突分析相关的数据
- `6_divergence_time` 包含用于BEAST分析的数据矩阵及分析输出结果
- `alismatales_40genes_1000M_generations_fix_hypothesis1.xml` 为BEAST的输入文件
- `alismatales_40genes_1000M_generations_fix_hypothesis1.10percent.tre` 为通过TreeAnnotator生成的共识树,即补充材料图S4中的系统发育树
- `7_whole_genome_duplication` 包含用于绘制Ks图的Ks值数据
- `8_gene_evolution_ko_analyses` 包含基因家族的序列比对文件与系统发育树,同时还包含带有KEGG(Kyoto Encyclopedia of Genes and Genomes,京都基因与基因组百科全书)注释信息的矩阵
- `gene_orthologs` 包含序列比对文件与单基因树:
- `9_Plant_Photos_Figshare` 包含原始植物照片
- `mafft_file_for_tree_build`:用于构建系统发育树的20个序列比对文件
- `ortholog_tree`:最终基因家族树的文件
- `KEGG_orthologs` 包含特定KEGG同源基因(KO,KEGG Ortholog)的注释数目:
- `matrix_for_enrichment_test.tsv`:针对4687个KEGG同源基因与95个物种的基因拷贝数矩阵
- `taxon_list_nov52021` 包含物种名称缩写与完整学名
针对以1、2、3、4开头的文件夹,其分析方法遵循https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/ 中的流程
以下为对应`5_phylogenetic_conflict`文件夹的系统发育冲突分析操作说明:
5.1. 四分位采样(Quartet Sampling)分析(所在文件夹:/5_phylogenetic_conflict/5.1_fig_s2_alismatales_input_quartet_sampling)
执行命令:
quartet_sampling.py --tree MO_1005_astral_speciestree --align ortholog_MO_1005_concatinated.fa.phy --reps 100 --threads 6 --lnlike 2
quartet_sampling.py --tree MO_1005_astral_speciestree --align ortholog_MO_1005_concatinated.fa.phy --reps 100 --threads 6 --lnlike 2
5.2. PhyloNet分析(所在文件夹:/5_phylogenetic_conflict/5.2_fig3_phylonet)
文件夹`5.2_fig3_phylonet`包含两类文件:以`.nex`为后缀的为输入文件,以`output`为后缀的为分析输出结果。需对每个输入文件执行如下命令:
java -jar -Xmx140G /PATH_TO_PHYLONET/PhyloNet_3.8.2.jar .nex > _output
5.3. Consel分析(所在文件夹:/5_phylogenetic_conflict/5.3_fig4a_all_consel_feb19/estimate_site_wise_log_likelihood_values)
首先,估算分支长度(所在子文件夹:/5.3_fig4a_all_consel_feb19/estimate_site_wise_log_likelihood_values)
执行如下循环命令:
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology1.tre -n $a"_output_topology1.cons" -p 123456 -N 10 -o atri291; done &
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology2.tre -n $a"_output_topology2.cons" -p 123456 -N 10 -o atri291; done &
for a in *aln-cln; do raxml -T 10 -f d -s $a -m GTRGAMMA -g alismatales_10species_topology3.tre -n $a"_output_topology3.cons" -p 123456 -N 10 -o atri291; done &
上述命令将生成3个以`.cons`为后缀的树文件,将这三个文件合并并重命名为`*_output_3topologies`
随后,估算每位点对数似然值:
for a in *alismatales.fa.aln-cln
do raxml -T 16 -f G -z $a'_output_3topologies' -s $a -r $a’_output_3topologies’ -m GTRGAMMA -n $a"_sitelh"
done
使用上一步生成的`RAxML_perSiteLLs*sitelh`文件进行后续分析
需重命名上一步生成的文件,因seqmt与makermt的输入文件需保持一致的文件名,仅后缀不同。例如,将`RAxML_perSiteLLs.cluster939_1.ortho.fa_alismatales.fa.aln-cln.sitelh`重命名为`cluster939_1.ortho.fa_alismatales.fa.aln-cln.sitelh`
随后执行如下命令:
for filename in $(ls *.sitelh); do seqmt --puzzle $filename; done
for filename in $(ls *.aln-cln); do makermt $filename; done
for filename in $(ls *.rmt); do consel $filename; done
for filename in $(ls *.pv); do catpv $filename > $filename.out; done
使用Shell脚本提取所有以`*out`为后缀的文件中的得分值,并生成类似`Congruent_au_test.csv`的文件
执行`python au.py`命令以展示分析结果
5.4. RAxML似然值统计(所在文件夹:/5_phylogenetic_conflict/5.4_fig4b_likelihood_raxml_output)
通过如下命令从`RAxML_info.RAxML_bestTree*`文件中提取最大似然得分:
for i in $(ls RAxML_info*);
do
echo $i >> ../all_ln_consel.txt
grep "Tree" $i >> ../all_ln_consel.txt
done
随后使用Shell脚本生成格式类似`all_ln_consel.txt`的文件,该过程可通过脚本`extract_raxml_infor_ln.py`辅助完成
执行`python ln_counts.py`命令以展示分析结果
5.5. 用于统计泽泻目三种假说支持率的系统发育树(所在文件夹:/5_phylogenetic_conflict/5.5_fig4c_798trees)
5.6. 多歧性检验(Polytomy Test)输出结果(所在文件夹:/5_phylogenetic_conflict/5.6_polytomy_test)
如有任何疑问,请联系陈凌云(邮箱:lychen83@qq.com)
陈凌云
提供机构:
figshare
创建时间:
2024-10-09



