five

Datasets for manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis

收藏
DataCite Commons2024-10-06 更新2024-11-05 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_for_manuscript_Phylotranscriptomics_reveals_the_phylogeny_of_Asparagales_and_the_evolution_of_allium_flavor_biosynthesis/25516204/7
下载链接
链接失效反馈
官方服务:
资源简介:
<b>These are the datasets used in the manuscript: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis. Major steps/scripts are also included.</b><br>The file '1_CDS_PEP.tar.gz' includes CDS and PEP for all the 501 samples used for phylogenetic analyses of Asparagales. The full species names were listed in the Supplementary Data 1. '1_CDS_PEP.tar.gz' was generated from steps 1 to 6.'2_501taxa_857orthologsgene.zip' includes sequences for the 857 orthologs obtained using DISCO, the RAxML tree for each ortholog, and the ASTRAL tree inferred from the 857 ortholog trees. '2_501taxa_857orthologsgene.zip' was generated in step 7.'3_501taxa_857concatenatedgenes.zip' includes the RAxML tree inferred from the 857 concatenated orthologs. '3_501taxa_857concatenatedgenes.zip' was generated in step 7.'4_Asparagales_biogeography.tre' and '4_Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.'5_PhyloNet' are input and output files of PhyloNet for the 18 clades of Asparagales. This dataset is for step 9.'6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml', the data matrix used for divergence time estimation. Six independent analyses were conducted. Specifically, run this *.xml file using BEAST six times. Then, output of the six runs was combined and TreeAnnotator was used to summarize divergence time.'7_CSO_biosynthesis_pathway.zip' includes sequences of the genes in CSOs biosynthesis pathway and their RAxML trees.'8_Gene_ALL_LFS_TreePL.zip' includes files for divergence time estimation of ALL and LFS.'taxon_list' includes two columns. The first column is the species abbreviations and the second is the full species names.'Source Data', Source Data for this manuscript.Steps 1 to 5 adopted scripts from Ya and Smith (2014) and followed its methods with some revision: Yang, Y. &amp; Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular biology and evolution 31, 3081-3092 (2014). https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/.<br><b>Step 1: Read processing</b>For paired-end reads:<i>python2 filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Magnoliophyta both num_cores output_dir clean</i>For single-end reads:<i>python2 filter_fq.py taxonID_1.fq.gz Magnoliophyta both num_cores output_dir clean</i><br><b>Step 2: Transcriptome assembly using Trinity (https://github.com/trinityrnaseq/trinityrnaseq)</b><i>python2 trinity_wrapper.py taxonID_1.overep_filtered.fq.gz taxonID_2.overep_filtered.fq.gz taxonID num_cores max_memory_GB stranded output_dir</i><br><b>Step 3: Get the longest transcript in each gene from the Trinity assembly and translate transcripts to CDS and PEP sequences</b>Execute the script <i>get_longest_isoform_seq_per_trinity_gene.pl</i> to get the longest transcripts.<i>get_longest_isoform_seq_per_trinity_gene.pl taxonID.Trinity.fasta &gt;taxonID.longest_transcripts.fa</i>Translate the longest transcripts within Trinity assembly to CDS and PEP sequences using TransDecoder.We used PEP sequences of rice, onion, garlic, and Arabidopsis thaliana as references. These genomes were accessed from: https://phytozome.jgi.doe.gov/ (Oryza sativa v7.0), https://figshare.com/search?q=garlic(garlic, dataset posted on 2020-06-24), https://www.oniongenome.wur.nl/(Onion gene AA sequences v1.2), and Arabidopsis thaliana reference genome TAIR10.1 (https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_000001735.4).<i>python2 transdecoder_wrapper.py taxonID.longest_transcripts.fa num_cores non-stranded output_dir</i><b>Step 4: Clustering and extract homologs</b>4.1. Before homology search, check the sequence names are formatted correctly and PEPs and CDSs have matching sequence names. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID.<i>python2 check_names.py DIR_includes_CDS_PEP file_ending</i>Reduce sequence redundancy.<i>cd-hit-est -i taxonID.fa.cds -o taxonID.cdhitest -c 0.99 -n 10 -r 0 -T num_cores</i><br>Combine CDS sequences of all 501 samples in this study into one file.<i>cat *.cdhitest &gt;all.fa</i><br>4.2. Run all by all BLASTN for CDSs of the 501 samples.<i>makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fa</i><i>blastn -db all.fa -query all.fa -evalue 10 -num_threads 20 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'</i><br>4.3. Extract homologs<i>python2 blast_to_mcl.py all.rawblast 0.25 &gt;mcl_all_rawblast_out_nohup_out_0.25</i><i>mcl mcl_all_rawblast_out_nohup_out_0.25 --abc -te 80 -tf 'gq(5)' -I 1.5 -o hit-frac0.25_I1.5_e5</i><i>python2 write_fasta_files_from_mcl.py all.fa hit-frac0.25_I1.5_e5 minimal_taxa outDIR</i><br><b>Step 5: Build Maximum likehood tree for each homolog group</b><i>python2 fasta_to_tree_pxclsq.py fasta_dir number_cores aa bootstrap(y)</i><br><b>Step 6: Extract ortholog groups</b>Instead of 1to1, MI, RT, or MO methods in Ya et al. (2014), we used a revised version of DISCO (https://github.com/JSdoubleL/DISCO) to infer homologs. The revised DISCO is named “<i>ca_disco.py</i>”. The revised DISCO software is available in this page.6.1. Convert the aligned homolog groups from fasta format to phylip format using <i>phylpwrtr.py </i>(this script was deposited on this website).<i>python /path_DISCO/phylpwrtr.py input.fa output.fa max_number_allowed_in_sequence_head</i><i>python ca_disco.py -i a_text_file_includes_all_homolog_trees -o output.phy -a a_text_file_includes_the_path_of_aligned_homologs -t a_list_file_includes_names_of_the_501_samples -m min_number_of_taxon_allowed_for_each_ortholog</i>a_text_file_includes_all_homolog_trees: each line of this file include one homolog tree.We set the minimum number of taxon as 350.<br>6.2. Extract sequences of orthologs.Convert the phylip format of output.phy to fasta format (named “output.fasta”) using <i>phylpwrtr.py</i>.The “output.fasta” includes sequences for the 857 ortholog extracted sequences (concatenated). When executing the <i>ca_disco.py</i>, the script prints partition information on screen. Save the information as “partition.txt”.Next, split the concatenated sequences of the 857 orthologs into 857 individual orthologs using AMAS (https://github.com/marekborowiec/AMAS).<i>python AMAS.py split -f fasta -d dna -i output.fasta -l partition.txt -u fasta</i><br>Remove the gaps in each of the 857 orthologs using our script <i>remove_species_gaps.py</i>.<i>python2 remove_species_gaps.py input_alignment.fasta output_alignment.fasta</i><br><b>Step 7: Build the species tree and the concatenated tree of Asparagales</b>7.1. Build a species tree using ASTRAL.<i>python2 fasta_to_tree_pxclsq.py folder_include_the_857_ortholog_groups number_of_cores dna y</i><i>java -jar /path_to ASTRAL/astral.5.7.8.jar -i a_text_file_includes_the_857_RAxML_tree -o ASTRAL_output.tree</i>7.2. Build a concatenated tree using RAxML.Put the aligned sequences of the 857 orthologs (ended with aln-cln) into a folder.<i>python2 concatenate_matrices_phyx.py folder_include_the_857_aln-cln_file 100 30 output_concatenated_file</i><i>raxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 12345 -# 100 -p 12345 -s name_of_the_concatenated_857_ortholog_sequences -m GTRCAT -n name_of_the_concatenated_857_ortholog_sequences</i><br>Change the species abbreviations to full names in phylogenetic trees.<i>python2 taxon_name_subst.py taxon_list name_of_phylogenetic_tree</i><br><b>Step 8: Assess phylogenetic conflicts</b>8.1. Execute Quartet Sampling (https://github.com/FePhyFoFum/quartetsampling)<i>python3 /path_to_quartet_sampling/quartet_sampling.py --tree RAxML_tree_of_the_857_concatenated_orthologs --align phylip_format_of_the_857_concatenated_ortholog_sequences.phy --reps 100 --threads 60 --lnlike 2</i><br>8.2. Calculate the concordant/conflicting bipartitions and internode certainty all (ICA) using PhyParts.First, root 857 ortholog trees using scripts accessed from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/scripts/<i>python2 root_trees_multiple_outgroups_MO.py folder_includes_the_857_ortholog_trees tre output_name a_list_includes_names_of_outgroups</i>Next, execute PhyParts.<i>java -Xmx40g -jar path_to_phyparts_folder/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d dir_includes_the_857_rooted_trees -m name_of_the_Asparagales_species_tree -o output_name -s 50 -v</i><br><b>Step 9: PhyloNet</b>In Linux, type the following command line to run PhyloNet_3.8.2.jar. The input files are ended with .nex. Each clade has five independent runs. The output files are ended with .output. For example:<i>nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet1.nex &gt;asparagales_phylonet1.output</i><i>nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet2.nex &gt;asparagales_phylonet2.output</i>...“asparagales_phylonet1.output” and “asparagales_phylonet2.output” is the output of PhyloNet.<br><b>Step 10: Expression level for 13 genes in allium flavor pathway</b>10.1. Filter adapters and low-quality bases using fastp (https://github.com/OpenGene/fastp).<i>fastp -i taxonname_R1.fq.gz -o taxonname_R1.output.fq -I taxonname_R2.fq.gz -O taxonname_R2.output.fq</i>10.2. Assemble transcriptome using Trinity.<i>Trinity --seqType fq --samples_file a_text_file_includes_names_of_input_fastq_files --CPU 10 --max_memory 100G --output taxon_Trinity</i>10.3. Get the longest transcripts in each gene of Trinity assembly using the script get_longest_isoform_seq_per_trinity_gene.pl in Trinity package.<i>./path_to_Trinity/get_longest_isoform_seq_per_trinity_gene.pl taxon_Trinity.fasta &gt;taxon.longest_cluster_transcripts.fa</i>10.4. Translate to CDS and PEP files using TransDecoder (https://github.com/TransDecoder/TransDecoder).<i>python2 transdecoder_wrapper.py taxon.longest_cluster_transcripts.fa 5 non-stranded taxon output_dir_name</i>10.5. Build index using Salmon.<i>salmon index -t CDS_obtained_from_Transdecoder -i output_index_name</i>10.6. Map sequence reads to the Salmon index<i>salmon quant -i name_index_file_from_last_step -l A -1 taxon_reads.1.fq -2 taxon_reads.2.fq -o output_name</i>The Salmon output includes a file named “quant.sf”, which includes the expression level of each genes obtained from TransDecoder. The expression TPM is in the 4th column of the “quant.sf”.10.7. Extract the TPM.After execute Salmon, each species has three quant.sf files, named as quant1.sf quant2.sf quant3.sf.Extract the TPM for the three replicates of each species using SHELL commands. The 1st column of the quant.sf is the sequence name and the 4th column is the TPM.<i>paste &lt;(awk '{print $1 " " $4}' quant1.sf) &lt;(awk '{print $1 " " $4}' quant2.sf) &lt;(awk '{print $1 " " $4}' quant3.sf) &gt; taxonmerged_data.txt</i>10.8. Average the TPM of three replicates and save it in the 7th column. The 1st column is the gene names, while the 2nd, 4th, and 6th columns are the TPM.<i>awk '{print $0, ($2+$4+$6)/3}' taxonmerged_data.txt &gt; taxonmerged_data_with_average.txt</i><i>awk '{print $1, $7}' taxonmerged_data_with_average.txt &gt; taxonextracted_columns.txt</i><i>while IFS= read -r line; do grep "$line" taxonextracted_columns.txt; done taxon.fa taxon.tpm</i>“taxon.fa” includes the gene names in each species each gene. For example, the “taxon.fa” for gene CAT in species “alltai” includes genes: “alltai@DN105953c0g1i1”, “alltai@DN6385c0g1i14”, “alltaiDN42476c0g1i6”, “alltai@DN51139c2g1i4”, and “alltai@DN51139c1g1i3”. We extracted the TPM of the five genes and saved it in “taxon.tpm”. We obtained the “taxon.fa” by checking the phylogenetic trees of the 13 genes (Supplementary Figs. 11–21) in allium flavor biosynthesis.10.9. Average the TPM of genes in each species each gene.<i>awk '{ sum += $2 } END { if (NR &gt; 0) print sum / NR }' taxon.tpm</i>10.10. Summarize the TPM of the genes in each species each genes.<i>awk '{sum += $2} END {print sum}' taxon.tpm</i>Then, we used the output to plot Fig. 4d.<br>If you have any questions, please do not hesitate to contact Lingyun Chen: lychen83@qq.com or lychen@cpu.edu.cn.
提供机构:
figshare
创建时间:
2024-10-06
二维码
社区交流群
二维码
科研交流群
商业服务