Datasets for manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis

Name: Datasets for manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis
Creator: figshare
Published: 2024-10-06 00:37:08
License: 暂无描述

DataCite Commons2024-10-06 更新2024-11-05 收录

下载链接：

https://figshare.com/articles/dataset/Datasets_for_manuscript_Phylotranscriptomics_reveals_the_phylogeny_of_Asparagales_and_the_evolution_of_allium_flavor_biosynthesis/25516204/7

下载链接

链接失效反馈

官方服务：

资源简介：

These are the datasets used in the manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis. Major steps/scripts are also included. The file '1_CDS_PEP.tar.gz' includes CDS and PEP for all the 501 samples used for phylogenetic analyses of Asparagales. The full species names were listed in the Supplementary Data 1. '1_CDS_PEP.tar.gz' was generated from steps 1 to 6.'2_501taxa_857orthologsgene.zip' includes sequences for the 857 orthologs obtained using DISCO, the RAxML tree for each ortholog, and the ASTRAL tree inferred from the 857 ortholog trees. '2_501taxa_857orthologsgene.zip' was generated in step 7.'3_501taxa_857concatenatedgenes.zip' includes the RAxML tree inferred from the 857 concatenated orthologs. '3_501taxa_857concatenatedgenes.zip' was generated in step 7.'4_Asparagales_biogeography.tre' and '4_Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.'5_PhyloNet' are input and output files of PhyloNet for the 18 clades of Asparagales. This dataset is for step 9.'6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml', the data matrix used for divergence time estimation. Six independent analyses were conducted. Specifically, run this *.xml file using BEAST six times. Then, output of the six runs was combined and TreeAnnotator was used to summarize divergence time.'7_CSO_biosynthesis_pathway.zip' includes sequences of the genes in CSOs biosynthesis pathway and their RAxML trees.'8_Gene_ALL_LFS_TreePL.zip' includes files for divergence time estimation of ALL and LFS.'taxon_list' includes two columns. The first column is the species abbreviations and the second is the full species names.'Source Data', Source Data for this manuscript.Steps 1 to 5 adopted scripts from Ya and Smith (2014) and followed its methods with some revision: Yang, Y. & Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular biology and evolution 31, 3081-3092 (2014). https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/. Step 1: Read processingFor paired-end reads:python2 filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Magnoliophyta both num_cores output_dir cleanFor single-end reads:python2 filter_fq.py taxonID_1.fq.gz Magnoliophyta both num_cores output_dir clean Step 2: Transcriptome assembly using Trinity (https://github.com/trinityrnaseq/trinityrnaseq)python2 trinity_wrapper.py taxonID_1.overep_filtered.fq.gz taxonID_2.overep_filtered.fq.gz taxonID num_cores max_memory_GB stranded output_dir Step 3: Get the longest transcript in each gene from the Trinity assembly and translate transcripts to CDS and PEP sequencesExecute the script get_longest_isoform_seq_per_trinity_gene.pl to get the longest transcripts.get_longest_isoform_seq_per_trinity_gene.pl taxonID.Trinity.fasta >taxonID.longest_transcripts.faTranslate the longest transcripts within Trinity assembly to CDS and PEP sequences using TransDecoder.We used PEP sequences of rice, onion, garlic, and Arabidopsis thaliana as references. These genomes were accessed from: https://phytozome.jgi.doe.gov/ (Oryza sativa v7.0), https://figshare.com/search?q=garlic(garlic, dataset posted on 2020-06-24), https://www.oniongenome.wur.nl/(Onion gene AA sequences v1.2), and Arabidopsis thaliana reference genome TAIR10.1 (https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_000001735.4).python2 transdecoder_wrapper.py taxonID.longest_transcripts.fa num_cores non-stranded output_dirStep 4: Clustering and extract homologs4.1. Before homology search, check the sequence names are formatted correctly and PEPs and CDSs have matching sequence names. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID.python2 check_names.py DIR_includes_CDS_PEP file_endingReduce sequence redundancy.cd-hit-est -i taxonID.fa.cds -o taxonID.cdhitest -c 0.99 -n 10 -r 0 -T num_cores Combine CDS sequences of all 501 samples in this study into one file.cat *.cdhitest >all.fa 4.2. Run all by all BLASTN for CDSs of the 501 samples.makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fablastn -db all.fa -query all.fa -evalue 10 -num_threads 20 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' 4.3. Extract homologspython2 blast_to_mcl.py all.rawblast 0.25 >mcl_all_rawblast_out_nohup_out_0.25mcl mcl_all_rawblast_out_nohup_out_0.25 --abc -te 80 -tf 'gq(5)' -I 1.5 -o hit-frac0.25_I1.5_e5python2 write_fasta_files_from_mcl.py all.fa hit-frac0.25_I1.5_e5 minimal_taxa outDIR Step 5: Build Maximum likehood tree for each homolog grouppython2 fasta_to_tree_pxclsq.py fasta_dir number_cores aa bootstrap(y) Step 6: Extract ortholog groupsInstead of 1to1, MI, RT, or MO methods in Ya et al. (2014), we used a revised version of DISCO (https://github.com/JSdoubleL/DISCO) to infer homologs. The revised DISCO is named “ca_disco.py”. The revised DISCO software is available in this page.6.1. Convert the aligned homolog groups from fasta format to phylip format using phylpwrtr.py (this script was deposited on this website).python /path_DISCO/phylpwrtr.py input.fa output.fa max_number_allowed_in_sequence_headpython ca_disco.py -i a_text_file_includes_all_homolog_trees -o output.phy -a a_text_file_includes_the_path_of_aligned_homologs -t a_list_file_includes_names_of_the_501_samples -m min_number_of_taxon_allowed_for_each_orthologa_text_file_includes_all_homolog_trees: each line of this file include one homolog tree.We set the minimum number of taxon as 350. 6.2. Extract sequences of orthologs.Convert the phylip format of output.phy to fasta format (named “output.fasta”) using phylpwrtr.py.The “output.fasta” includes sequences for the 857 ortholog extracted sequences (concatenated). When executing the ca_disco.py, the script prints partition information on screen. Save the information as “partition.txt”.Next, split the concatenated sequences of the 857 orthologs into 857 individual orthologs using AMAS (https://github.com/marekborowiec/AMAS).python AMAS.py split -f fasta -d dna -i output.fasta -l partition.txt -u fasta Remove the gaps in each of the 857 orthologs using our script remove_species_gaps.py.python2 remove_species_gaps.py input_alignment.fasta output_alignment.fasta Step 7: Build the species tree and the concatenated tree of Asparagales7.1. Build a species tree using ASTRAL.python2 fasta_to_tree_pxclsq.py folder_include_the_857_ortholog_groups number_of_cores dna yjava -jar /path_to ASTRAL/astral.5.7.8.jar -i a_text_file_includes_the_857_RAxML_tree -o ASTRAL_output.tree7.2. Build a concatenated tree using RAxML.Put the aligned sequences of the 857 orthologs (ended with aln-cln) into a folder.python2 concatenate_matrices_phyx.py folder_include_the_857_aln-cln_file 100 30 output_concatenated_fileraxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 12345 -# 100 -p 12345 -s name_of_the_concatenated_857_ortholog_sequences -m GTRCAT -n name_of_the_concatenated_857_ortholog_sequences Change the species abbreviations to full names in phylogenetic trees.python2 taxon_name_subst.py taxon_list name_of_phylogenetic_tree Step 8: Assess phylogenetic conflicts8.1. Execute Quartet Sampling (https://github.com/FePhyFoFum/quartetsampling)python3 /path_to_quartet_sampling/quartet_sampling.py --tree RAxML_tree_of_the_857_concatenated_orthologs --align phylip_format_of_the_857_concatenated_ortholog_sequences.phy --reps 100 --threads 60 --lnlike 2 8.2. Calculate the concordant/conflicting bipartitions and internode certainty all (ICA) using PhyParts.First, root 857 ortholog trees using scripts accessed from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/scripts/python2 root_trees_multiple_outgroups_MO.py folder_includes_the_857_ortholog_trees tre output_name a_list_includes_names_of_outgroupsNext, execute PhyParts.java -Xmx40g -jar path_to_phyparts_folder/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d dir_includes_the_857_rooted_trees -m name_of_the_Asparagales_species_tree -o output_name -s 50 -v Step 9: PhyloNetIn Linux, type the following command line to run PhyloNet_3.8.2.jar. The input files are ended with .nex. Each clade has five independent runs. The output files are ended with .output. For example:nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet1.nex >asparagales_phylonet1.outputnohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet2.nex >asparagales_phylonet2.output...“asparagales_phylonet1.output” and “asparagales_phylonet2.output” is the output of PhyloNet. Step 10: Expression level for 13 genes in allium flavor pathway10.1. Filter adapters and low-quality bases using fastp (https://github.com/OpenGene/fastp).fastp -i taxonname_R1.fq.gz -o taxonname_R1.output.fq -I taxonname_R2.fq.gz -O taxonname_R2.output.fq10.2. Assemble transcriptome using Trinity.Trinity --seqType fq --samples_file a_text_file_includes_names_of_input_fastq_files --CPU 10 --max_memory 100G --output taxon_Trinity10.3. Get the longest transcripts in each gene of Trinity assembly using the script get_longest_isoform_seq_per_trinity_gene.pl in Trinity package../path_to_Trinity/get_longest_isoform_seq_per_trinity_gene.pl taxon_Trinity.fasta >taxon.longest_cluster_transcripts.fa10.4. Translate to CDS and PEP files using TransDecoder (https://github.com/TransDecoder/TransDecoder).python2 transdecoder_wrapper.py taxon.longest_cluster_transcripts.fa 5 non-stranded taxon output_dir_name10.5. Build index using Salmon.salmon index -t CDS_obtained_from_Transdecoder -i output_index_name10.6. Map sequence reads to the Salmon indexsalmon quant -i name_index_file_from_last_step -l A -1 taxon_reads.1.fq -2 taxon_reads.2.fq -o output_nameThe Salmon output includes a file named “quant.sf”, which includes the expression level of each genes obtained from TransDecoder. The expression TPM is in the 4th column of the “quant.sf”.10.7. Extract the TPM.After execute Salmon, each species has three quant.sf files, named as quant1.sf quant2.sf quant3.sf.Extract the TPM for the three replicates of each species using SHELL commands. The 1st column of the quant.sf is the sequence name and the 4th column is the TPM.paste <(awk '{print $1 " " $4}' quant1.sf) <(awk '{print $1 " " $4}' quant2.sf) <(awk '{print $1 " " $4}' quant3.sf) > taxonmerged_data.txt10.8. Average the TPM of three replicates and save it in the 7th column. The 1st column is the gene names, while the 2nd, 4th, and 6th columns are the TPM.awk '{print $0, ($2+$4+$6)/3}' taxonmerged_data.txt > taxonmerged_data_with_average.txtawk '{print $1, $7}' taxonmerged_data_with_average.txt > taxonextracted_columns.txtwhile IFS= read -r line; do grep "$line" taxonextracted_columns.txt; done taxon.fa taxon.tpm“taxon.fa” includes the gene names in each species each gene. For example, the “taxon.fa” for gene CAT in species “alltai” includes genes: “alltai@DN105953c0g1i1”, “alltai@DN6385c0g1i14”, “alltaiDN42476c0g1i6”, “alltai@DN51139c2g1i4”, and “alltai@DN51139c1g1i3”. We extracted the TPM of the five genes and saved it in “taxon.tpm”. We obtained the “taxon.fa” by checking the phylogenetic trees of the 13 genes (Supplementary Figs. 11–21) in allium flavor biosynthesis.10.9. Average the TPM of genes in each species each gene.awk '{ sum += $2 } END { if (NR > 0) print sum / NR }' taxon.tpm10.10. Summarize the TPM of the genes in each species each genes.awk '{sum += $2} END {print sum}' taxon.tpmThen, we used the output to plot Fig. 4d. If you have any questions, please do not hesitate to contact Lingyun Chen: lychen83@qq.com or lychen@cpu.edu.cn.

提供机构：

figshare

创建时间：

2024-10-06