five

Estimating accurate gene trees in the presence of intra-locus recombination: A simulation study

收藏
Mendeley Data2024-04-13 更新2024-06-29 收录
下载链接:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.6q573n626
下载链接
链接失效反馈
官方服务:
资源简介:
Data: There are four compressed tar folders in this submission. Each folder contains a dataset. Datasets are divided into three separate folders Newick files of simulated trees, phylip files of simulated alignments, and tskit object files (output of msprime for simulated trees) The most used dataset is Oct_16_2020_100k_Compressed.tar.gz It represents simulated regions totaling in length 100,000 base pairs For only ARG methods we tested simulated regions of less than 100,000 namely a 10,000 base pair dataset: Oct_16_2020_10k_Compressed.tar.gz and a 1,000 base pair dataset: Oct_16_2020_1k_Compressed.tar.gz For only the fixed length approach (312,625,1250,2500,5000,10000) using FastTree we used a different dataset of total length 100,000 This dataset was subsampled for these fixed lengths. This dataset was not used to represent the fixed length points in the MDS plots, only in the heat maps in figure 3. June_2_2020_100k_Compressed.tar.gz Supplemental: Supplemental Materials.doc contains supplemental Figures Script Overview: Code used for simulation, tree estimation, and RF comparison are provided. Code was designed to work on another computer, but this has not been tested. As a result minor errors might need fixing. Requirements: Languages used are mainly python, bash for pipelines, and R for RF comparison. Bash shell scripts will only work in a Linux environment with the parallel command installed. R packages Required (plus dependencies): ape phangorn Python3 packages: msprime tskit tsinfer Rent+, Relate, and Seq-gen must be installed separately, and their paths must be specified in the scripts. Script Files: There are two bash scripts which are pipelines for Co-Estimation methods/Data generation and Maximum Likelihood methods. There is also a general plotting script which needs minor modifications to run on any given file. 1.) Simulated Data Generation and Co-Estimation methods Can be run with: ./ARG_DATA_PROGRAMS_CSV_Pipe.sh The script contains variables which need to be set: paths to programs (Seq-gen, Relate, RentPlus), directories (code, data), and simulation parameters. The pipeline runs subpipelines dealing with, Data generation, Relate, Tsinfer, RentPlus, and then organizing the comparison to simulated trees. A CSV file representing the heatmap results will be output in the Results/ directory. The specific scripts and their purpose are listed below: PipeLine_All.sh -Subpipeline used to generate simulated data testmsprime.py -Uses the python package msprime to simulate gene trees for various parameters for a certain number of parameters SeqGenAncestralNewick.py -Creates Newick files with random ancestral sequence for all simulated replicates createCommandLineToRunSeqGen.py -Creates a bash file with all commands to run seq on each replicate AddAncestralToPhylip.py -Adds the ancestral sequence to the simulated output alignments of Seq gen. RelatePipeline.sh -Subpipeline used to run the Co-estimation program Relate CreateRelateCmdLine.py -Create bash script with commands to run Relate for all simulated replicates Relate_Run.py -Runs Relate on all simulated replicates RelateRF.py -Collect Relate outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees RelateConverter_RF2.py -calculate RF distance between simulated and estimated trees. Outputs results to a file TsinferPipeLine.sh -Subpipeline used to run the Co-estimation program Tsinfer CreateTsinferCmdLine.py -Create bash script with commands to run Tsinfer for all simulated replicates TsinferConverter.py -Runs Tsinfer on all simulated replicates TsinferRF.py -Collect Tsinfer outputfiles and creats a bash script with commands to calculate RF distance between simulated and estimated trees TsinferConverter_RF2.py -calculate RF distance between simulated and estimated trees. Outputs results to a file RentPlusPipeLine.sh -Subpipeline used to run the Co-estimation program Rent+ CreateRentPlusCmdLine.py -Create bash script with commands to run Rent+ for all simulated replicates RentPlusConverterKZ.py -Runs Rent+ on all simulated replicates RentPlusSolutionPipeLine.sh -Subpipeline used to calculate RF distance for Rent+ KendallColijnPrep.py -Create subfiles setting up calculation of simulated and estimated trees for Rent+ RentPlus_RF.r -calculate RF distance between simulated and estimated trees. Outputs results to a file CreateResultsRentPlus.py -collect result of RF distances and output into a structured file for each replicate. ResultsToCSV.py -Collects output of RF distances for all replicates and converts the data into a single CSV file 2.) Maximum Likelihood Tree Estimation can be run with: ./FastTreeScript.sh The script contains variables which need to be set: paths to programs (FastTree), directories (code, data), and simulation parameters. A CSV file representing the heatmap results will be output in the Results/ directory. The specific scripts and their purpose are listed below: FastTreeWholeEstTrue.py -splits simulated alignments in three ways: simulated c-gens, estimated c-gens, and the whole alignment (fixed length 100,000). Runs FastTree on the subsampled alignments and calculates the RF distance between estimated and simulated trees using Rescale_RF_File.r Rescale_RF_File.r -Takes in two newickfiles and calculates RF distance between them FastTreeResultsWTE.py -Collects the RF results file names and put the names in one file ResultsToCSV_WTE.py -Converts the RF results files into a CSV file for plotting FastTreeScriptCmdLine.py -Creates a bash script that runs all other Fixed Lengths on simulated data (312,625,1250,2500,5000,10000) using FastTreeScript.py FastTreeScript.py -Runs FastTree on a given interval Fixed Length intervals IqtreeProcessFT_InputFile_RF.py -Calculates RF distances for Fixed Length estimated trees RF_File.r -Takes in two newickfiles and calculates RF distance between them ResultsToCSV2.py -Converts the RF results files into a CSV file for plotting 3.) A general Plotting script: PlotHeatMap.py Line 21 contains the lstFiles variable, which contains the CSV files to plot.
创建时间:
2023-06-28
二维码
社区交流群
二维码
科研交流群
商业服务