Dataset for: Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis, Nature Communications,DOI:10.1038/s41467-024-53943-6
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_for_manuscript_Phylotranscriptomics_reveals_the_phylogeny_of_Asparagales_and_the_evolution_of_allium_flavor_biosynthesis/25516204
下载链接
链接失效反馈官方服务:
资源简介:
Dataset for Phylotranscriptomics reveals the phylogeny of Asparagales and the evolution of allium flavor biosynthesis (DOI: 10.1038/s41467-024-53943-6).
The file 'Source Data' is the 'Source data' in the paper. Differences between 'Source Data' and other files (we call them 'other files'): 'Soure Data' includes the data used for plotting figures and tables. 'Other files' include assembled transcripts, CDS and PEP, othologous genes, and the output of PhyloNet, etc. Overall, the 'other files' have more data.
'Fig1_assembled_transcriptomes_196trinity_*.7z', the de novo assembled transcriptomes, which were used for phylogenomic analyses in this study. 'Fig.4d_assembled_transcriptome.7z', the dataset used for gene expression level analysis.
Please refer to Supplementary Data 1 and 8 for species's full names.
‘Fig1_CDS_PEP_compress9.7z’ includes the CDSS and PEPs for the 501 samples used for phylogenetic analyses of Asparagales. 。“Fig1_CDS_PEP_compress9.7z” includes files generated from step 1 to 6.
'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' includes sequences for the 857 orthologs obtained using DISCO, the RAxML tree for each ortholog, and the ASTRAL tree inferred from the 857 ortholog trees. 'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' was generated in step 7.
'Supp_1_3_4_501taxa_857concatenatedgenes.7z' includes a concatenated matrix of the 857 orthologs and the RAxML tree inferred from the matrix. 'Supp_1_3_4_501taxa_857concatenatedgenes' was generated in step 7.
'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.
'Fig4b_Supp_Fig1-16_CSO_biosynthesisi_pathway.zip' includes sequences of the genes in CSOs biosynthesis pathway and their RAxML trees.
'Supp_Fig5_PhyloNet.7z' includes input and output files of PhyloNet for the 18 clades of Asparagales. This dataset is for step 9.
'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml', the data matrix used for divergence time estimation. Six independent analyses were conducted. Specifically, run this *.xml file using BEAST six times. Then, output of the six runs was combined and TreeAnnotator was used to summarize divergence time.
'Supp_Fig16_Gene_ALL_LFS_TreePL.zip' includes files for divergence time estimation of ALL and LFS.
'taxon_list' includes the species abbreviations (first column) and the full species names (second column).
'software_disco.zip' includes the revised version of software DISCO used to extract ortholog sequences from homolog trees and sequences.
'remove_species_gaps.py' is a custom script, which was used to remove gaps in sequences.
'boxplot with jittered points.r', a script used to plot the ages of genes ALL and LFS.
Steps 1 to 5 adopted scripts from Ya and Smith (2014) and followed its methods with some revision: Yang, Y. & Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular biology and evolution 31, 3081-3092 (2014). https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/.
Step 1: Read processing
Transcriptome reads were downloaded from NCBI SRA or generated in this study. Filter low-quality reads and organic reads.
For paired-end reads:
python2 filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Magnoliophyta both num_cores output_dir clean
For single-end reads:
python2 filter_fq.py taxonID_1.fq.gz Magnoliophyta both num_cores output_dir clean
Step 2: Transcriptome assembly using Trinity (https://github.com/trinityrnaseq/trinityrnaseq)
python2 trinity_wrapper.py taxonID_1.overep_filtered.fq.gz taxonID_2.overep_filtered.fq.gz taxonID num_cores max_memory_GB stranded output_dir
Step 3: Get the longest transcript in each gene from the Trinity assembly and translate transcripts to CDS and PEP sequences
Execute the script get_longest_isoform_seq_per_trinity_gene.pl to get the longest transcripts.
get_longest_isoform_seq_per_trinity_gene.pl taxonID.Trinity.fasta >taxonID.longest_transcripts.fa
Translate the longest transcripts within Trinity assembly to CDS and PEP sequences using TransDecoder.
We used PEP sequences of rice, onion, garlic, and Arabidopsis thaliana as references. These genomes were accessed from: https://phytozome.jgi.doe.gov/ (Oryza sativa v7.0), https://figshare.com/search?q=garlic(garlic, dataset posted on 2020-06-24), https://www.oniongenome.wur.nl/(Onion gene AA sequences v1.2), and Arabidopsis thaliana reference genome TAIR10.1 (https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_000001735.4).
python2 transdecoder_wrapper.py taxonID.longest_transcripts.fa num_cores non-stranded output_dir
Step 4: Clustering and extract homologs
4.1. Before homology search, check the sequence names are formatted correctly and PEPs and CDSs have matching sequence names. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID.
python2 check_names.py DIR_includes_CDS_PEP file_ending
Reduce sequence redundancy.
cd-hit-est -i taxonID.fa.cds -o taxonID.cdhitest -c 0.99 -n 10 -r 0 -T num_cores
Combine CDS sequences of all 501 samples in this study into one file.
cat *.cdhitest >all.fa
4.2. Run all by all BLASTN for CDSs of the 501 samples.
makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fa
blastn -db all.fa -query all.fa -evalue 10 -num_threads 20 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'
4.3. Extract homologs for all the 501 samples
python2 blast_to_mcl.py all.rawblast 0.25 >mcl_all_rawblast_out_nohup_out_0.25
mcl mcl_all_rawblast_out_nohup_out_0.25 --abc -te 80 -tf 'gq(5)' -I 1.5 -o hit-frac0.25_I1.5_e5
python2 write_fasta_files_from_mcl.py all.fa hit-frac0.25_I1.5_e5 minimal_taxa outDIR
Step 5: Build Maximum likelihood tree for each homolog group
python2 fasta_to_tree_pxclsq.py fasta_dir number_cores dna bootstrap(y)
Step 6: Extract ortholog groups
Instead of 1to1, MI, RT, or MO methods in Ya et al. (2014), we used a revised version of DISCO (https://github.com/JSdoubleL/DISCO) to infer homologs. The revised DISCO is named “ca_disco”, which is available on this page.
6.1. Convert the aligned homolog groups from fasta format to phylip format using phylpwrtr.py (this script was deposited on this website).
python /path_DISCO/phylpwrtr.py input.fa output.fa max_number_allowed_in_sequence_head
python ca_disco.py -i a_text_file_includes_all_homolog_trees -o output.phy -a a_text_file_includes_the_path_of_aligned_homologs -t a_list_file_includes_names_of_the_501_samples -m min_number_of_taxon_allowed_for_each_ortholog
a_text_file_includes_all_homolog_trees: each line of this file include one homolog tree.
We set the minimum number of taxon as 350.
6.2. Extract sequences of orthologs.
Convert the phylip format of output.phy to fasta format (named “output.fasta”) using phylpwrtr.py.
The “output.fasta” includes sequences for the 857 ortholog extracted sequences (concatenated). When executing the ca_disco.py, the script prints partition information on screen. Save the information as “partition.txt”.
Next, split the concatenated sequences of the 857 orthologs into 857 individual orthologs using AMAS (https://github.com/marekborowiec/AMAS).
python AMAS.py split -f fasta -d dna -i output.fasta -l partition.txt -u fasta
Remove the gaps in sequences of the 857 orthologs using our script remove_species_gaps.py.
python2 remove_species_gaps.py input_alignment.fasta output_alignment.fasta
Step 7: Build the species tree (Supplementary Fig. 1 and 3 and Fig. 1a) and the concatenated tree (Supplementary Fig. 4) of Asparagales
7.1. Build a species tree using ASTRAL.
python2 fasta_to_tree_pxclsq.py folder_include_the_857_ortholog_groups number_of_cores dna y
java -jar /path_to ASTRAL/astral.5.7.8.jar -i a_text_file_includes_the_857_RAxML_tree -o ASTRAL_output.tree
7.2. Build a concatenated tree using RAxML.
Put the aligned sequences of the 857 orthologs (ended with aln-cln) into a folder.
python2 concatenate_matrices_phyx.py folder_include_the_857_aln-cln_file 100 30 output_concatenated_file
raxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 12345 -# 100 -p 12345 -s name_of_the_concatenated_857_ortholog_sequences -m GTRCAT -n name_of_the_concatenated_857_ortholog_sequences
Change the species abbreviations to full names in phylogenetic trees.
python2 taxon_name_subst.py taxon_list name_of_phylogenetic_tree
Step 8: Assess phylogenetic conflicts
8.1. Calculate the concordant/conflicting bipartitions and internode certainty all (ICA) using PhyParts (Supplementary Fig. 3).
First, root 857 ortholog trees using script root_trees_multiple_outgroups_MO.py accessed from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/scripts/
python2 root_trees_multiple_outgroups_MO.py DIR_includes_the_857_ortholog_trees tree_file_ending output_DIR a_list_includes_names_of_outgroups
Next, execute PhyParts.
java -Xmx40g -jar path_to_phyparts_folder/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d dir_includes_the_857_rooted_trees -m name_of_the_Asparagales_species_tree -o output_name -s 50 -v
8.2. Execute Quartet Sampling (https://github.com/FePhyFoFum/quartetsampling) by using the concatenated matrix of the 857 orthologs and the concatenated RAxML tree (Supplementary Fig. 4).
python3 /path_to_quartet_sampling/quartet_sampling.py --tree RAxML_tree_of_the_857_concatenated_orthologs --align phylip_format_of_the_857_concatenated_ortholog_sequences.phy --reps 100 --threads 60 --lnlike 2
We used species abbreviations instead of full species names during analyses. Execute the following command to replace abbreviations with full species names in phylogenetic trees.
python taxon_name_subst.py *.tre
Step 9: PhyloNet (Supplementary Fig. 5)
In Linux, type the following command line to run PhyloNet_3.8.2.jar. The input files are ended with .nex. Each clade has five independent runs. The output files are ended with .output. For example:
nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet1.nex >asparagales_phylonet1.output
nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet2.nex >asparagales_phylonet2.output
...
“asparagales_phylonet1.output” and “asparagales_phylonet2.output” is the output of PhyloNet.
Step 10. Divergence time estimation using BEAST2 (Supplementary Fig. 6)
Execute the file 'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml' using the online BEAST2 in CIPRES (https://www.phylo.org/). We executed six independent runs. The output was combined. The first 10% of trees were discarded as burn-in, and the remaining trees were used to generate a summary tree with TreeAnnotator v.2.6.3.
Step 11. Biogeographic analysis (Fig. 3 and Supplementary Fig. 7)
Ancestral areas were reconstructed using BioGeoBEARS, which was implemented in RASP (https://github.com/sculab/RASP). 'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.
Step 12. The divergence time of ALL and LFS (Supplementary Fig. 16)
Execute treePL (https://github.com/blackrim/treePL) to get dated trees.
treePL ALL_TreePL_config.txt
treePL LFS_TreePL_config.txt
Extract the ages in the two trees separately, naming them as 'ALL_date_extracted_20240206.txt' and 'LFS_date_extracted_20240206.txt'. Then, plot the age distribution using the script 'boxplot with jittered points.r'.
Step 13: Expression level for 13 genes in allium flavor pathway (Fig. 4d and Supplementary Fig. 17)
13.1. Filter adapters and low-quality bases using fastp (https://github.com/OpenGene/fastp).
fastp -i taxonname_R1.fq.gz -o taxonname_R1.output.fq -I taxonname_R2.fq.gz -O taxonname_R2.output.fq
13.2. Assemble transcriptome using Trinity.
Trinity --seqType fq --samples_file a_text_file_includes_names_of_input_fastq_files --CPU 10 --max_memory 100G --output taxon_Trinity
13.3. Get the longest transcripts in each gene of Trinity assembly using the script get_longest_isoform_seq_per_trinity_gene.pl in Trinity package.
./path_to_Trinity/get_longest_isoform_seq_per_trinity_gene.pl taxon_Trinity.fasta >taxon.longest_cluster_transcripts.fa
13.4. Translate the longest transcripts to CDS and PEP files using TransDecoder (https://github.com/TransDecoder/TransDecoder).
python2 transdecoder_wrapper.py taxon.longest_cluster_transcripts.fa 5 non-stranded taxon output_dir_name
13.5. Build index using Salmon.
salmon index -t CDS_obtained_from_Transdecoder -i output_index_name
13.6. Map sequence reads to the Salmon index
salmon quant -i name_index_file_from_last_step -l A -1 taxon_reads.1.fq -2 taxon_reads.2.fq -o output_name
The Salmon output includes a file named “quant.sf”, which includes the expression level of each genes obtained from TransDecoder. The expression TPM is in the 4th column of the “quant.sf”.
13.7. Extract the TPM.
After running Salmon, each species has three quant.sf files, renamed as quant1.sf, quant2.sf, quant3.sf.
Extract the TPM for the three replicates of each species using SHELL commands. The 1st column of the quant.sf is the sequence name, the 4th column is the TPM.
paste <(awk '{print $1 " " $4}' quant1.sf) <(awk '{print $1 " " $4}' quant2.sf) <(awk '{print $1 " " $4}' quant3.sf) > taxonmerged_data.txt
10.8. Average the TPM of three replicates and save it in the 7th column. The 1st column is the gene names.
awk '{print $0, ($2+$4+$6)/3}' taxonmerged_data.txt > taxonmerged_data_with_average.txt
awk '{print $1, $7}' taxonmerged_data_with_average.txt > taxonextracted_columns.txt
while IFS= read -r line; do grep "$line" taxonextracted_columns.txt; done taxon.fa taxon.tpm
“taxon.fa” includes the gene names in each species each gene. For example, for gene CAT in species “alltai” includes genes: “alltai@DN105953c0g1i1”, “alltai@DN6385c0g1i14”, “alltaiDN42476c0g1i6”, “alltai@DN51139c2g1i4”, and “alltai@DN51139c1g1i3”. We extracted the TPM of the five genes and saved it in “taxon.tpm”. We obtained the “taxon.fa” by checking the phylogenetic trees of the 13 genes (Supplementary Figs. 11-21) in allium flavor biosynthesis.
10.9. Average the TPM of genes in each species each genes.
awk '{ sum += $2 } END { if (NR > 0) print sum / NR }' taxon.tpm
10.10. Summarize the TPM of the genes in each species each genes.
awk '{sum += $2} END {print sum}' taxon.tpm
Then, we used the output to plot Fig. 4d.
If you have any questions, please contact Lingyun Chen at lychen83@qq.com or lychen@cpu.edu.cn.
创建时间:
2024-04-01



