手稿来源数据和数据集：系统转录组学揭示了天门冬属的系统发育和葱香味生物合成的进化，Nature Communications，DOI：10.1038/s41467-024-53943-6

Name: 手稿来源数据和数据集：系统转录组学揭示了天门冬属的系统发育和葱香味生物合成的进化，Nature Communications，DOI：10.1038/s41467-024-53943-6
Creator: figshare
Published: 2024-12-08 00:59:58
License: 暂无描述

DataCite Commons2024-12-08 更新2025-01-06 收录

下载链接：

https://figshare.com/articles/dataset/Datasets_for_manuscript_Phylotranscriptomics_reveals_the_phylogeny_of_Asparagales_and_the_evolution_of_allium_flavor_biosynthesis/25516204/15

下载链接

链接失效反馈

官方服务：

资源简介：

这些是研究中使用的源数据和数据集：系统转录组学揭示了天门冬属的系统发育和葱香味生物合成的进化（DOI： 10.1038/s41467-024-53943-6）。还包括主要步骤/脚本。 文件 'Source Data' 是本研究的 Source Data。文件“源数据”与其他文件（我们称之为“其他文件”）的区别：“源数据”主要包括用于生成本研究中的图形和表格的文件。“其他文件”包括组装的转录组、CDS 和 PEP、直系同源物序列、PhyloNet 的输出等。总体而言，“其他文件”更全面、更大。'Fig1_assembled_transcriptomes_196trinity_*.7z'，使用本研究中生成的读数组装的从头转录组，用于系统发育分析。'Fig.4d_assembled_transcriptome.7z'，使用本研究中生成的读数组装的从头转录组，用于基因表达水平分析。从头转录组组装中存在一些错误，请不要使用文件 'Fig1_assembled_transcriptomes_196trinity_*.7z'。我很快就会上传最新版本（Lingyun Chen， 2024 年 11 月 9 日）。有关物种的完整名称，请参阅“补充数据 20241005.xlsx”中的补充数据 1 和数据 8。“Fig1_CDS_PEP_compress9.7z”包括用于天门冬植物系统发育分析的所有 501 个样品的 CDS 和 PEP。完整的物种名称列在补充数据 1 中。“Fig1_CDS_PEP_compress9.7z”是从步骤 1 到 6 生成的。'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' includes sequences for the 857 orthologs obtained using DISCO, the RAxML tree for each ortholog, and the ASTRAL tree inferred from the 857 ortholog trees. 'Fig1_Supp_1_3_4_501taxa_857orthologsgenes.7z' was generated in step 7.'Supp_1_3_4_501taxa_857concatenatedgenes.7z' includes a concatenated matrix of the 857 orthologs and the RAxML tree inferred from the matrix. 'Supp_1_3_4_501taxa_857concatenatedgenes' was generated in step 7.'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.'Fig4b_Supp_Fig1-16_CSO_biosynthesisi_pathway.zip' includes sequences of the genes in CSOs biosynthesis pathway and their RAxML trees.'Supp_Fig5_PhyloNet.7z' includes input and output files of PhyloNet for the 18 clades of Asparagales. This dataset is for step 9.'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml', the data matrix used for divergence time estimation. Six independent analyses were conducted. Specifically, run this *.xml file using BEAST six times. Then, output of the six runs was combined and TreeAnnotator was used to summarize divergence time.'Supp_Fig16_Gene_ALL_LFS_TreePL.zip' includes files for divergence time estimation of ALL and LFS.'taxon_list' includes the species abbreviations (first column) and the full species names (second column).'software_disco.zip' includes the revised version of software DISCO used to extract ortholog sequences from homolog trees and sequences.'remove_species_gaps.py' is a custom script, which was used to remove gaps in sequences.'boxplot with jittered points.r', a script used to plot the ages of genes ALL and LFS. Steps 1 to 5 adopted scripts from Ya and Smith (2014) and followed its methods with some revision: Yang, Y. & Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular biology and evolution 31, 3081-3092 (2014). https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/. Step 1: Read processingTranscriptome reads were downloaded from NCBI SRA or generated in this study. Filter low-quality reads and organic reads.For paired-end reads:python2 filter_fq.py taxonID_1.fq.gz taxonID_2.fq.gz Magnoliophyta both num_cores output_dir cleanFor single-end reads:python2 filter_fq.py taxonID_1.fq.gz Magnoliophyta both num_cores output_dir clean Step 2: Transcriptome assembly using Trinity (https://github.com/trinityrnaseq/trinityrnaseq)python2 trinity_wrapper.py taxonID_1.overep_filtered.fq.gz taxonID_2.overep_filtered.fq.gz taxonID num_cores max_memory_GB stranded output_dir Step 3: Get the longest transcript in each gene from the Trinity assembly and translate transcripts to CDS and PEP sequencesExecute the script get_longest_isoform_seq_per_trinity_gene.pl to get the longest transcripts.get_longest_isoform_seq_per_trinity_gene.pl taxonID.Trinity.fasta >taxonID.longest_transcripts.faTranslate the longest transcripts within Trinity assembly to CDS and PEP sequences using TransDecoder.We used PEP sequences of rice, onion, garlic, and Arabidopsis thaliana as references. These genomes were accessed from: https://phytozome.jgi.doe.gov/ (Oryza sativa v7.0), https://figshare.com/search?q=garlic(garlic, dataset posted on 2020-06-24), https://www.oniongenome.wur.nl/(Onion gene AA sequences v1.2), and Arabidopsis thaliana reference genome TAIR10.1 (https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_000001735.4).python2 transdecoder_wrapper.py taxonID.longest_transcripts.fa num_cores non-stranded output_dirStep 4: Clustering and extract homologs4.1. Before homology search, check the sequence names are formatted correctly and PEPs and CDSs have matching sequence names. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID.python2 check_names.py DIR_includes_CDS_PEP file_endingReduce sequence redundancy.cd-hit-est -i taxonID.fa.cds -o taxonID.cdhitest -c 0.99 -n 10 -r 0 -T num_cores Combine CDS sequences of all 501 samples in this study into one file.cat *.cdhitest >all.fa 4.2. Run all by all BLASTN for CDSs of the 501 samples.makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fablastn -db all.fa -query all.fa -evalue 10 -num_threads 20 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' 4.3. Extract homologs for all the 501 samplespython2 blast_to_mcl.py all.rawblast 0.25 >mcl_all_rawblast_out_nohup_out_0.25mcl mcl_all_rawblast_out_nohup_out_0.25 --abc -te 80 -tf 'gq(5)' -I 1.5 -o hit-frac0.25_I1.5_e5python2 write_fasta_files_from_mcl.py all.fa hit-frac0.25_I1.5_e5 minimal_taxa outDIR Step 5: Build Maximum likelihood tree for each homolog grouppython2 fasta_to_tree_pxclsq.py fasta_dir number_cores dna bootstrap(y) Step 6: Extract ortholog groupsInstead of 1to1, MI, RT, or MO methods in Ya et al. (2014), we used a revised version of DISCO (https://github.com/JSdoubleL/DISCO) to infer homologs. The revised DISCO is named “ca_disco”, which is available on this page.6.1. Convert the aligned homolog groups from fasta format to phylip format using phylpwrtr.py (this script was deposited on this website).python /path_DISCO/phylpwrtr.py input.fa output.fa max_number_allowed_in_sequence_headpython ca_disco.py -i a_text_file_includes_all_homolog_trees -o output.phy -a a_text_file_includes_the_path_of_aligned_homologs -t a_list_file_includes_names_of_the_501_samples -m min_number_of_taxon_allowed_for_each_orthologa_text_file_includes_all_homolog_trees: each line of this file include one homolog tree.We set the minimum number of taxon as 350. 6.2. Extract sequences of orthologs.Convert the phylip format of output.phy to fasta format (named “output.fasta”) using phylpwrtr.py.The “output.fasta” includes sequences for the 857 ortholog extracted sequences (concatenated). When executing the ca_disco.py, the script prints partition information on screen. Save the information as “partition.txt”.Next, split the concatenated sequences of the 857 orthologs into 857 individual orthologs using AMAS (https://github.com/marekborowiec/AMAS).python AMAS.py split -f fasta -d dna -i output.fasta -l partition.txt -u fasta Remove the gaps in sequences of the 857 orthologs using our script remove_species_gaps.py.python2 remove_species_gaps.py input_alignment.fasta output_alignment.fasta Step 7: Build the species tree (Supplementary Fig. 1 and 3 and Fig. 1a) and the concatenated tree (Supplementary Fig. 4) of Asparagales7.1. Build a species tree using ASTRAL.python2 fasta_to_tree_pxclsq.py folder_include_the_857_ortholog_groups number_of_cores dna yjava -jar /path_to ASTRAL/astral.5.7.8.jar -i a_text_file_includes_the_857_RAxML_tree -o ASTRAL_output.tree7.2. Build a concatenated tree using RAxML.Put the aligned sequences of the 857 orthologs (ended with aln-cln) into a folder.python2 concatenate_matrices_phyx.py folder_include_the_857_aln-cln_file 100 30 output_concatenated_fileraxmlHPC-PTHREADS-SSE3 -T 10 -f a -x 12345 -# 100 -p 12345 -s name_of_the_concatenated_857_ortholog_sequences -m GTRCAT -n name_of_the_concatenated_857_ortholog_sequences Change the species abbreviations to full names in phylogenetic trees.python2 taxon_name_subst.py taxon_list name_of_phylogenetic_tree Step 8: Assess phylogenetic conflicts8.1. Calculate the concordant/conflicting bipartitions and internode certainty all (ICA) using PhyParts (Supplementary Fig. 3).First, root 857 ortholog trees using script root_trees_multiple_outgroups_MO.py accessed from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/scripts/python2 root_trees_multiple_outgroups_MO.py DIR_includes_the_857_ortholog_trees tree_file_ending output_DIR a_list_includes_names_of_outgroupsNext, execute PhyParts.java -Xmx40g -jar path_to_phyparts_folder/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d dir_includes_the_857_rooted_trees -m name_of_the_Asparagales_species_tree -o output_name -s 50 -v8.2. Execute Quartet Sampling (https://github.com/FePhyFoFum/quartetsampling) by using the concatenated matrix of the 857 orthologs and the concatenated RAxML tree (Supplementary Fig. 4).python3 /path_to_quartet_sampling/quartet_sampling.py --tree RAxML_tree_of_the_857_concatenated_orthologs --align phylip_format_of_the_857_concatenated_ortholog_sequences.phy --reps 100 --threads 60 --lnlike 2We used species abbreviations instead of full species names during analyses. Execute the following command to replace abbreviations with full species names in phylogenetic trees.python taxon_name_subst.py *.tre Step 9: PhyloNet (Supplementary Fig. 5)In Linux, type the following command line to run PhyloNet_3.8.2.jar. The input files are ended with .nex. Each clade has five independent runs. The output files are ended with .output. For example:nohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet1.nex >asparagales_phylonet1.outputnohup java -jar /Path_to_Dir_PhyloNet/PhyloNet_3.8.2.jar asparagales_phylonet2.nex >asparagales_phylonet2.output...“asparagales_phylonet1.output” and “asparagales_phylonet2.output” is the output of PhyloNet.Step 10. Divergence time estimation using BEAST2 (Supplementary Fig. 6)Execute the file 'Supp_Fig6_Asparagales_15sortadategenes_7calibrationpoints_BEASTi_run1.xml' using the online BEAST2 in CIPRES (https://www.phylo.org/). We executed six independent runs. The output was combined. The first 10% of trees were discarded as burn-in, and the remaining trees were used to generate a summary tree with TreeAnnotator v.2.6.3.Step 11. Biogeographic analysis (Fig. 3 and Supplementary Fig. 7)Ancestral areas were reconstructed using BioGeoBEARS, which was implemented in RASP (https://github.com/sculab/RASP). 'Fig3_Asparagales_biogeography.tre' and 'Asparagales_distribution_20240206.csv' are datasets used for biogeographic analyses.Step 12. The divergence time of ALL and LFS (Supplementary Fig. 16)Execute treePL (https://github.com/blackrim/treePL) to get dated trees.treePL ALL_TreePL_config.txttreePL LFS_TreePL_config.txtExtract the ages in the two trees separately, naming them as 'ALL_date_extracted_20240206.txt' and 'LFS_date_extracted_20240206.txt'. Then, plot the age distribution using the script 'boxplot with jittered points.r'. Step 13: Expression level for 13 genes in allium flavor pathway (Fig. 4d and Supplementary Fig. 17)13.1. Filter adapters and low-quality bases using fastp (https://github.com/OpenGene/fastp).fastp -i taxonname_R1.fq.gz -o taxonname_R1.output.fq -I taxonname_R2.fq.gz -O taxonname_R2.output.fq13.2. Assemble transcriptome using Trinity.Trinity --seqType fq --samples_file a_text_file_includes_names_of_input_fastq_files --CPU 10 --max_memory 100G --output taxon_Trinity13.3. Get the longest transcripts in each gene of Trinity assembly using the script get_longest_isoform_seq_per_trinity_gene.pl in Trinity package../path_to_Trinity/get_longest_isoform_seq_per_trinity_gene.pl taxon_Trinity.fasta >taxon.longest_cluster_transcripts.fa13.4. Translate the longest transcripts to CDS and PEP files using TransDecoder (https://github.com/TransDecoder/TransDecoder).python2 transdecoder_wrapper.py taxon.longest_cluster_transcripts.fa 5 non-stranded taxon output_dir_name13.5. Build index using Salmon.salmon index -t CDS_obtained_from_Transdecoder -i output_index_name13.6. Map sequence reads to the Salmon indexsalmon quant -i name_index_file_from_last_step -l A -1 taxon_reads.1.fq -2 taxon_reads.2.fq -o output_nameThe Salmon output includes a file named “quant.sf”, which includes the expression level of each genes obtained from TransDecoder. The expression TPM is in the 4th column of the “quant.sf”.13.7. Extract the TPM.执行 Salmon 后，每个物种都有三个 quant.sf 文件，命名为 quant1.sf quant2.sf quant3.sf。使用 SHELL 命令提取每个物种的三个重复的 TPM。quant.sf 的第 1 列是序列名称，第 4 列是 TPM。粘贴 <（awk '{print $1 “ ” $4}' quant1.sf） <（awk '{print $1 “ ” $4}' quant2.sf） <（awk '{print $1 “ ” $4}' quant3.sf） > taxonmerged_data.txt13.8. 平均 3 次重复的 TPM 并将其保存在第 7 列中。第 1 列是基因名称，而第 2 列、第 4 列和第 6 列是 TPM。awk '{打印 $0，（$2+$4+$6）/3}' taxonmerged_data.txt > taxonmerged_data_with_average.txtawk '{print $1， $7}' taxonmerged_data_with_average.txt > taxonextracted_columns.txtwhile IFS= 读取 -r 行;do grep “$line” taxonextracted_columns.txt;完成 taxon.fa taxon.tpm“taxon.fa” 包括每个物种、每个基因的基因名称。例如，物种 “alltai” 中基因 CAT 的 “taxon.fa” 包括基因：“alltai@DN105953c0g1i1”、“alltai@DN6385c0g1i14”、“alltaiDN42476c0g1i6”、“alltai@DN51139c2g1i4”和“alltai@DN51139c1g1i3”。我们提取了 5 个基因的 TPM 并将其保存在 “taxon.tpm” 中。我们通过检查葱属风味生物合成中 13 个基因的系统发育树（补充图 11-21）获得了 “taxon.fa”。13.9. 平均每个物种中每个基因的 TPM。awk '{ sum += $2 } END { if （NR > 0）打印总和 / NR }' taxon.tpm13.10. 总结每个物种中每个基因的 TPM。awk '{sum += $2} END {print sum}' taxon.tpm然后，我们使用输出来绘制图 4d。 如果您有任何问题，请随时联系 Lingyun Chen：lychen83@qq.com 或 lychen@cpu.edu.cn。

提供机构：

figshare

创建时间：

2024-12-07