five

Source Data and datasets for manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis, submitted to Nature Communications

收藏
DataCite Commons2024-10-23 更新2024-11-06 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_for_manuscript_Phylotranscriptomics_reveals_the_phylogeny_of_Asparagales_and_the_evolution_of_allium_flavor_biosynthesis/25516204/11
下载链接
链接失效反馈
官方服务:
资源简介:
<b>These are the Source Data and datasets used in the study: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis. Major steps/scripts are also included.</b><br><b>The file 'Source Data' is the Source Data for this study. </b><b>Difference between the file "Source Data" and other files (we call it "other files"): The</b><b> 'Source Data' mainly includes the files used to generate the figures and tables in this study. "other files" include assembled transcriptomes, CDSs and PEPs, sequences of orthologs, output of PhyloNet, etc. Overall, the "other files" </b><b>is more comprehensive and bigger.</b>‘Fig1_assembled_transcriptomes_196trinity_*.7z’, de novo transcriptome assembled using reads generated in this study, used for phylogenetic analyses. 'Fig.4d_assembled_transcriptome.7z', de novo transcriptome assembled using reads generated in this study, used for gene expression level analysis.Please refer to Supplementary Data 1 and Data 8 in ‘Supplementary Data 20241005.xlsx‘ for full species names.'Fig1_CDS_PEP_compress9.7z' includes CDSs and PEPs for all the 501 samples used for phylogenetic analyses of Asparagales. The full species names were listed in Supplementary Data 1. 'Fig1_CDS_PEP_compress9.7z' was generated from steps 1 to 6.'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' includes sequences for the 857 orthologs obtained using DISCO, the RAxML tree for each ortholog, and the ASTRAL tree inferred from the 857 ortholog trees. 'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' was generated in step 7.'Supp_1_3_4_501taxa_857concatenatedgenes.7z' includes a concatenated matrix of the 857 orthologs and the RAxML tree inferred from the matrix. 'Supp_1_3_4_501taxa_857concatenatedgenes' was generated in step 7.'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.'Supp_Fig5_PhyloNet.7z' are input and output files of PhyloNet for the 18 clades of Asparagales. This dataset is for step 9.'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml', the data matrix used for divergence time estimation. Six independent analyses were conducted. Specifically, run this *.xml file using BEAST six times. Then, output of the six runs was combined and TreeAnnotator was used to summarize divergence time.'Fig4b_Supp_Fig1-16_CSO_biosynthesisi_pathway.zip' includes sequences of the genes in CSOs biosynthesis pathway and their RAxML trees.'Supp_Fig16_Gene_ALL_LFS_TreePL.zip' includes files for divergence time estimation of ALL and LFS.'taxon_list' includes the species abbreviations (first column) and the full species names (second column).'software_disco.zip' includes the revised version of software DISCO used to extract ortholog sequences from homolog trees and sequences.'remove_species_gaps.py' is a custom script, which was used to remove gaps in sequences.<br><br>Steps 1 to 5 adopted scripts from Ya and Smith (2014) and followed its methods with some revision: Yang, Y. &amp; Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular biology and evolution 31, 3081-3092 (2014). https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/.<br><b>Step 1: Read processing</b>Transcriptome reads were downloaded from NCBI SRA or generated in this study. Filter low-quality reads and organic reads. For paired-end reads:<i>python2 filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Magnoliophyta both num_cores output_dir clean</i>For single-end reads:<i>python2 filter_fq.py taxonID_1.fq.gz Magnoliophyta both num_cores output_dir clean</i><br><b>Step 2: Transcriptome assembly using Trinity (https://github.com/trinityrnaseq/trinityrnaseq)</b><i>python2 trinity_wrapper.py taxonID_1.overep_filtered.fq.gz taxonID_2.overep_filtered.fq.gz taxonID num_cores max_memory_GB stranded output_dir</i><br><b>Step 3: Get the longest transcript in each gene from the Trinity assembly and translate transcripts to CDS and PEP sequences</b>Execute the script <i>get_longest_isoform_seq_per_trinity_gene.pl</i> to get the longest transcripts.<i>get_longest_isoform_seq_per_trinity_gene.pl taxonID.Trinity.fasta &gt;taxonID.longest_transcripts.fa</i>Translate the longest transcripts within Trinity assembly to CDS and PEP sequences using TransDecoder.We used PEP sequences of rice, onion, garlic, and Arabidopsis thaliana as references. These genomes were accessed from: https://phytozome.jgi.doe.gov/ (Oryza sativa v7.0), https://figshare.com/search?q=garlic(garlic, dataset posted on 2020-06-24), https://www.oniongenome.wur.nl/(Onion gene AA sequences v1.2), and Arabidopsis thaliana reference genome TAIR10.1 (https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_000001735.4).<i>python2 transdecoder_wrapper.py taxonID.longest_transcripts.fa num_cores non-stranded output_dir</i><b>Step 4: Clustering and extract homologs</b>4.1. Before homology search, check the sequence names are formatted correctly and PEPs and CDSs have matching sequence names. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID.<i>python2 check_names.py DIR_includes_CDS_PEP file_ending</i>Reduce sequence redundancy.<i>cd-hit-est -i taxonID.fa.cds -o taxonID.cdhitest -c 0.99 -n 10 -r 0 -T num_cores</i><br>Combine CDS sequences of all 501 samples in this study into one file.<i>cat *.cdhitest &gt;all.fa</i><br>4.2. Run all by all BLASTN for CDSs of the 501 samples.<i>makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fa</i><i>blastn -db all.fa -query all.fa -evalue 10 -num_threads 20 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'</i><br>4.3. Extract homologs for all the 501 samples<i>python2 blast_to_mcl.py all.rawblast 0.25 &gt;mcl_all_rawblast_out_nohup_out_0.25</i><i>mcl mcl_all_rawblast_out_nohup_out_0.25 --abc -te 80 -tf 'gq(5)' -I 1.5 -o hit-frac0.25_I1.5_e5</i><i>python2 write_fasta_files_from_mcl.py all.fa hit-frac0.25_I1.5_e5 minimal_taxa outDIR</i><br><b>Step 5: Build Maximum likelihood tree for each homolog group</b><i>python2 fasta_to_tree_pxclsq.py fasta_dir number_cores dna bootstrap(y)</i><br><b>Step 6: Extract ortholog groups</b>Instead of 1to1, MI, RT, or MO methods in Ya et al. (2014), we used a revised version of DISCO (https://github.com/JSdoubleL/DISCO) to infer homologs. The revised DISCO is named “<i>ca_disco</i>”, which is available on this page.6.1. Convert the aligned homolog groups from fasta format to phylip format using <i>phylpwrtr.py </i>(this script was deposited on this website).<i>python /path_DISCO/phylpwrtr.py input.fa output.fa max_number_allowed_in_sequence_head</i><i>python ca_disco.py -i a_text_file_includes_all_homolog_trees -o output.phy -a a_text_file_includes_the_path_of_aligned_homologs -t a_list_file_includes_names_of_the_501_samples -m min_number_of_taxon_allowed_for_each_ortholog</i>a_text_file_includes_all_homolog_trees: each line of this file include one homolog tree.We set the minimum number of taxon as 350.<br>6.2. Extract sequences of orthologs.Convert the phylip format of output.phy to fasta format (named “output.fasta”) using <i>phylpwrtr.py</i>.The “output.fasta” includes sequences for the 857 ortholog extracted sequences (concatenated). When executing the <i>ca_disco.py</i>, the script prints partition information on screen. Save the information as “partition.txt”.Next, split the concatenated sequences of the 857 orthologs into 857 individual orthologs using AMAS (https://github.com/marekborowiec/AMAS).<i>python AMAS.py split -f fasta -d dna -i output.fasta -l partition.txt -u fasta</i><br>Remove the gaps in sequences of the 857 orthologs using our script <i>remove_species_gaps.py</i>.<i>python2 remove_species_gaps.py input_alignment.fasta output_alignment.fasta</i><br><b>Step 7: Build the species tree (Supplementary Fig. 1 and 3 and Fig. 1a) and the concatenated tree (Supplementary Fig. 4) of Asparagales</b>7.1. Build a species tree using ASTRAL.<i>python2 fasta_to_tree_pxclsq.py folder_include_the_857_ortholog_groups number_of_cores dna y</i><i>java -jar /path_to ASTRAL/astral.5.7.8.jar -i a_text_file_includes_the_857_RAxML_tree -o ASTRAL_output.tree</i>7.2. Build a concatenated tree using RAxML.Put the aligned sequences of the 857 orthologs (ended with aln-cln) into a folder.<i>python2 concatenate_matrices_phyx.py folder_include_the_857_aln-cln_file 100 30 output_concatenated_file</i><i>raxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 12345 -# 100 -p 12345 -s name_of_the_concatenated_857_ortholog_sequences -m GTRCAT -n name_of_the_concatenated_857_ortholog_sequences</i><br>Change the species abbreviations to full names in phylogenetic trees.<i>python2 taxon_name_subst.py taxon_list name_of_phylogenetic_tree</i><br><b>Step 8: Assess phylogenetic conflicts</b>8.1. Calculate the concordant/conflicting bipartitions and internode certainty all (ICA) using PhyParts (Supplementary Fig. 3).First, root 857 ortholog trees using script root_trees_multiple_outgroups_MO.py accessed from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/scripts/<i>python2 root_trees_multiple_outgroups_MO.py DIR_includes_the_857_ortholog_trees tree_file_ending output_DIR a_list_includes_names_of_outgroups</i>Next, execute PhyParts.<i>java -Xmx40g -jar path_to_phyparts_folder/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d dir_includes_the_857_rooted_trees -m name_of_the_Asparagales_species_tree -o output_name -s 50 -v</i>8.2. Execute Quartet Sampling (https://github.com/FePhyFoFum/quartetsampling) by using the concatenated matrix of the 857 orthologs and the concatenated tree (Supplementary Fig. 4).<i>python3 /path_to_quartet_sampling/quartet_sampling.py --tree RAxML_tree_of_the_857_concatenated_orthologs --align phylip_format_of_the_857_concatenated_ortholog_sequences.phy --reps 100 --threads 60 --lnlike 2</i>We used species abbreviations instead of full species names during analyses. Execute the following command to replace abbreviations with full species names.<i>python taxon_name_subst.py *.tre</i><br><b>Step 9: PhyloNet (Supplementary Fig. 5)</b>In Linux, type the following command line to run PhyloNet_3.8.2.jar. The input files are ended with .nex. Each clade has five independent runs. The output files are ended with .output. For example:<i>nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet1.nex &gt;asparagales_phylonet1.output</i><i>nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet2.nex &gt;asparagales_phylonet2.output</i>...“asparagales_phylonet1.output” and “asparagales_phylonet2.output” is the output of PhyloNet.<b>Step 10. Divergence time estimation using BEAST2 (Supplementary Fig. 6)</b>Execute the file 'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml' using the BEAST2 in https://www.phylo.org/. We executed six independent runs. The output was combined. The first 10% of trees were discarded as burn-in, and the remaining trees were used to generate a summary tree with TreeAnnotator v.2.6.3.<b>Step 11. Biogeographic analysis using RASP (https://github.com/sculab/RASP; Fig. 3 and Supplementary Fig. 7)</b>Ancestral areas were reconstructed using BioGeoBEARS, which was implemented in RASP. 'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.<b>Step 12. The divergence time of ALL and LFS (Supplementary Fig. 16)</b>Execute treePL (https://github.com/blackrim/treePL) get dated trees. <i>treePL ALL_TreePL_config.txt</i><i>treePL LFS</i><i>_TreePL_config.txt</i>Extract the ages in the two trees separately, naming them as 'ALL_date_extracted_20240206.txt' and 'LFS_date_extracted_20240206.txt'. Then use the script 'boxplot with jittered points.r' to plot the distribution.<br><b>Step 13: Expression level for 13 genes in allium flavor pathway (Fig. 4d and Supplementary Fig. 17)</b>13.1. Filter adapters and low-quality bases using fastp (https://github.com/OpenGene/fastp).<i>fastp -i taxonname_R1.fq.gz -o taxonname_R1.output.fq -I taxonname_R2.fq.gz -O taxonname_R2.output.fq</i>13.2. Assemble transcriptome using Trinity.<i>Trinity --seqType fq --samples_file a_text_file_includes_names_of_input_fastq_files --CPU 10 --max_memory 100G --output taxon_Trinity</i>13.3. Get the longest transcripts in each gene of Trinity assembly using the script get_longest_isoform_seq_per_trinity_gene.pl in Trinity package.<i>./path_to_Trinity/get_longest_isoform_seq_per_trinity_gene.pl taxon_Trinity.fasta &gt;taxon.longest_cluster_transcripts.fa</i>13.4. Translate to CDS and PEP files using TransDecoder (https://github.com/TransDecoder/TransDecoder).<i>python2 transdecoder_wrapper.py taxon.longest_cluster_transcripts.fa 5 non-stranded taxon output_dir_name</i>13.5. Build index using Salmon.<i>salmon index -t CDS_obtained_from_Transdecoder -i output_index_name</i>13.6. Map sequence reads to the Salmon index<i>salmon quant -i name_index_file_from_last_step -l A -1 taxon_reads.1.fq -2 taxon_reads.2.fq -o output_name</i>The Salmon output includes a file named “quant.sf”, which includes the expression level of each genes obtained from TransDecoder. The expression TPM is in the 4th column of the “quant.sf”.13.7. Extract the TPM.After executing Salmon, each species has three quant.sf files, named as quant1.sf quant2.sf quant3.sf.Extract the TPM for the three replicates of each species using SHELL commands. The 1st column of the quant.sf is the sequence name and the 4th column is the TPM.<i>paste &lt;(awk '{print $1 " " $4}' quant1.sf) &lt;(awk '{print $1 " " $4}' quant2.sf) &lt;(awk '{print $1 " " $4}' quant3.sf) &gt; taxonmerged_data.txt</i>13.8. Average the TPM of three replicates and save it in the 7th column. The 1st column is the gene names, while the 2nd, 4th, and 6th columns are the TPM.<i>awk '{print $0, ($2+$4+$6)/3}' taxonmerged_data.txt &gt; taxonmerged_data_with_average.txt</i><i>awk '{print $1, $7}' taxonmerged_data_with_average.txt &gt; taxonextracted_columns.txt</i><i>while IFS= read -r line; do grep "$line" taxonextracted_columns.txt; done taxon.fa taxon.tpm</i>“taxon.fa” includes the gene names in each species each gene. For example, the “taxon.fa” for gene CAT in species “alltai” includes genes: “alltai@DN105953c0g1i1”, “alltai@DN6385c0g1i14”, “alltaiDN42476c0g1i6”, “alltai@DN51139c2g1i4”, and “alltai@DN51139c1g1i3”. We extracted the TPM of the five genes and saved it in “taxon.tpm”. We obtained the “taxon.fa” by checking the phylogenetic trees of the 13 genes (Supplementary Figs. 11–21) in allium flavor biosynthesis.13.9. Average the TPM of genes in each species each gene.<i>awk '{ sum += $2 } END { if (NR &gt; 0) print sum / NR }' taxon.tpm</i>13.10. Summarize the TPM of the genes in each species each gene.<i>awk '{sum += $2} END {print sum}' taxon.tpm</i>Then, we used the output to plot Fig. 4d.<br>If you have any questions, please do not hesitate to contact Lingyun Chen: lychen83@qq.com or lychen@cpu.edu.cn.
提供机构:
figshare
创建时间:
2024-10-23
二维码
社区交流群
二维码
科研交流群
商业服务