Data and code for: Host-use Drives Convergent Evolution in Clownfish
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12625956
下载链接
链接失效反馈官方服务:
资源简介:
This folder contains the following files:
data/Alignments_WithOutgroups.tar.gz: Contains the alignments of 10,720 genes with the sequences of the outgroup Pomacentrus moluccensis. The gene IDs correspond to the ID of the Amphiprion frenatus reference genome (Marcionetti et al., 2018; https://datadryad.org/stash/dataset/doi:10.5061/dryad.nv1sv). The position of the gene on the Amphiprion percula chromosomes is also reported. For information on the methods and sample names, please refer to the publication. These alignments were used to infer the species tree with ASTRAL-III. Alignments for the genes selected with SortaDate and used for dating with BEAST are also available and are: chr04_g2455.t1.WithOutgroup.phy, chr05_g51486.t1.WithOutgroup.phychr05_g56452.t1.WithOutgroup.phy, chr08_g50086.t1.WithOutgroup.phy, chr09_g35092.t1.WithOutgroup.phy, chr09_g49030.t1.WithOutgroup.phy, chr10_g47484.t1.WithOutgroup.phy, chr11_g5494.t1.WithOutgroup.phy, chr11_g32313.t1.WithOutgroup.phy, chr12_g7961.t1.WithOutgroup.phy, chr12_g27572.t1.WithOutgroup.phy, chr12_g32580.t1.WithOutgroup.phy, chr13_g33152.t1.WithOutgroup.phy, chr15_g60485.t1.WithOutgroup.phy, chr16_g18013.t1.WithOutgroup.phy, chr17_g60288.t1.WithOutgroup.phy, chr22_g6154.t1.WithOutgroup.phy, chr22_g22141.t1.WithOutgroup.phy, chr22_g29206.t1.WithOutgroup.phy, chr23_g36756.t1.WithOutgroup.phy.
data/DatedTree.WithOutgroup.tree: BEAST2 output. The clownfish dated phylogenetic tree with the outgroup Pomacentrus moluccensis used for rooting. The tree was obtained with BEAST2, using 20 most informative genes. For each partition, we applied a GTR+ G site model and an uncorrelated relaxed clock with a lognormal distribution. A secondary calibration points was used, setting uniform prior from 10 to 18 MYA for the crown age of clownfishes. For more information on the methods, please refer to the publication.
data/Example.DatFile.evolver.tar.gz: Templates of the .dat files (MCcodonNSbranchsites.Shifts_to_Entacmaea.dat, MCcodonNSbranchsites.Shifts_to_Radianthus.dat) containing information to simulate sequences with evolver. The two .dat files were used to simulate sequences under different selection scenarios (no positive selection, convergent positive selection, positive selection on "long" or "clade" branches only) during the shifts to Entacmaeae or Radianthus hosts. The files were used with the scripts Create_DATFile_Evolved.Shift_to_Entacmaea.py and Create_DATFile_Evolved.Shift_to_Radianthus.py to generate .dat files for all the conditions, used then in evolver. For more information on the methods, please refer to the publication.
data/DatFiles.Evolver.tar.gz The .dat files that were obtained for the different omega and evolutionary scenarios, for shifts to Entacmaea and Radianthus hosts. The files are obtained with the template files (Example.DatFile.evolver.tar.gz) and the scripts Create_DATFile_Evolved.Shift_to_Entacmaea.py and Create_DATFile_Evolved.Shift_to_Radianthus.py. The resulting .dat files are run in evolver:
evolverNSbranchsites 6 DAT_FILES.dat
to obtain the codon alignment files to perform power and false positive rate analyses. For more information, please refer to the publication.
data/Alignments_ProteinCodingGenes.tar.gz: Contains the alignments of the 18,390 protein-coding genes analysed in the study. The gene IDs correspond to the ID of the Amphiprion frenatus reference genome (Marcionetti et al., 2018; https://datadryad.org/stash/dataset/doi:10.5061/dryad.nv1sv). The position of the gene on the Amphiprion percula chromosomes is also reported. For information on the methods and sample names, please refer to the publication. These alignments were used to test for convergent positive selection occurring during host shifts.
data/Example.ControlFiles.CodeML.tar.gz It contains examples of the control files for the null model (no positive selection, H0), alternative model (positive selection, H1), and the site model M1a (used to verify the correct optimization of the null model). The final control files for each gene and condition (shift to Entacmaea or Readianthus host) were generated with the script Create_CTLFile_CodeML.py.
data/LabelledTree.CodeML.tar.gz Tree files used in codeml analyses, with shifts to Entacmaea labelled as foreground branches (ClownTree.Rooted.Label_Entacmaea.nwk, ClownTree.Rooted.Label_Entacmaea.NoLongBranches.nwk) and shifts to Radianthus labelled as foreground branches (ClownTree.Rooted.Label_Radianthus.nwk, ClownTree.Rooted.Label_Radianthus.NoLongBranches.nwk). The trees do or do not have the 3 "long branches" species (A. ocellaris, A. percula, P. biaculeatus). For more information, refer to the publication. An additional folder (Additional_labelled_trees_for_test_on_simulated_data.tar.gz) containg the trees with only specific species kept and labelled. These trees were used for codeml analyses on simulated alignments, to investigate false positives and power of the analyses. For more information, refer to the publication.
data/SimulatedData_BranchSite_Results.tar.gz It contains the results for the branch site model on the simulated data. Each file name reports the simulated scenario (Simulated without positive selection: Simulated_NO_PS_Entacmaea / Simulated_NO_PS_Radianthus; simulated convergent positive selection: Simulated_PS_Entacmaea / Simulated_PS_Radianthus; Simulated positive selection on long branches : Simulated_PS_LongBranches / Simulated_PS_Premnas; Simulated positive selection on "clade" branches: Simulated_PS_AKA / Simulated_PS_Ephi), as well as the tested scenario (Tested for positive selection: Tested_PS_Entacmaea / Tested_PS_Radianthus; or tested for positive selection on specific branches). Each file contains the information on the name of the original file, the simulated scenario, the tested scenario, the simulated omega, the replicate number, the log-likelihood of the tested model (site model: M1a, null model without positive selection: H0, alternative model with positive selection: H1), and the p-values associated to the likelihood-ratio test (LRT_pvalue). For more information, refer to the publication.
data/EmpiricalData_BranchSite_Results.tar.gz It contains the results for the branch site model for the 18,390 protein-coding genes tested in the study, for shifts to Entacmaea (Results.BranchSiteModel.Shifts_To_Entacmaea.txt, Results.BranchSiteModel.Shifts_To_Entacmaea.NoLongBranches.txt) and shifts to Radianthus hosts (Results.BranchSiteModel.Shifts_To_Radianthus.txt, Results.BranchSiteModel.Shifts_To_Radianthus.NoLongBranches.txt). Each file contains information on the chromosome information of the analyzed gene, the name of the gene, the log-likelihood of the M1a model (site model, used to verify the correct optimization of the null model), the log-likelihood of the null model (H0) and the alternative model (H1), and the p-values associated to the likelihood-ratio test (LRT_pvalue). These p-values were subsequentially corrected for multiple testing. For more information, refer to the publication.
data/ASR_adult_host_4st.rds It contains the results of reproductive host associations ancestral states reconstruction to the form of a list() R object. In the list $joint returns a tree with joint ancestral states (returns the most likely ancestral reproductive host association at nodes), $marginal returns a tree with the likelihood of each state at nodes, $simmap returns 100 stochastic maps of ancestral states along branches of the tree calculated over the marginal reconstruction, $map returns a map of ancestral states along branches estimated from the joint reconstruction.
data/Absolute_host_assoc.tar.gz It contains description of the sources used for characterizing host associations for each species of clownfish. For each species, we provide a list of pictures used from public citizen science databases with associated urls and additional published references if used. reprod_host.csv contains our final classification of reproductive host associations.
data/DEC.tar.gz It contains files used for the biogeographic reconstruction (areas_adjacency_clowns.txt, areas_clowns.txt, calibrated_tree.tre, distances.txt, geo_col.txt) and results of the biogeographic reconstruction. geo_obj.rds is a R object containing the joint reconstruction of ancestral biogeographic states formatted for being used in phylogenetic comparative methods analyses. list_geo_obj.rds is a list of a 100 similar objects generated from stochastic maps of ancestral biogeographic states.
data/phenotype.tar.gz It contains results of clownfish individuals phenotyping. Within each file, the first column is the name of the species identified from the picture. morph_pca.csv contains results of the pca analysis performed on the procrustes of clownfish individuals. morph_traits.csv contains traits values calculated from the prcrustes of clownfish individuals. colorRGB.csv contains results of the pca analysis performed on the concatenated red green and blue channels of each clownfish image. colorWOB.csv contains results of the pca analyses performed independantly on white, orange and black channels. columns with names ending with W represent pca axis generated from white channel (O: orange channel, B: black channel)
scripts/Create_CTLFile_CodeML.py: Script used to generate the codeml control files for codeml analyses (simulated data or empirical data). The scripts needs the path to the folder were the codon alignments (see Alignments_ProteinCodingGenes.tar.gz or Alignments created for simulations) are found, the path were the output file are gonna be written; the path to the alignment, tree file and output as they will be written in the control file; the output suffix.
python Create_CTLFile_CodeML.py PATH/to/Alignments/ PATH/to/out/CTL_files/ alignment_in_ctlFile tree_in_ctlFile path_to_output_in_ctlFile Output_Suffix
This produces the control files of the null model (H0, no positive selection) and alternative model (H1, positive selection), that can be run with codeml
codeml CONTROL_FILE.ctl
to obtain the results. This was done using the tree with shifts to Ratianthus or Entacmaea as foreground branches (see LabelledTree.CodeML.tar.gz). This was also performed on real data or simulated data. For more information, refer to the publication.
scripts/Create_DATFile_Evolved.Shift_to_Entacmaea.pyscripts/Create_DATFile_Evolved.Shift_to_Radianthus.py Scripts used to generate the .dat files for evolver simulations. The scripts need the template .dat files (provided in Example.DatFile.evolver.tar.gz) and the information of the path where to save the resulting .dat files:
python Create_DATFile_Evolved.Shift_to_Entacmaea.py MCcodonNSbranchsites.Shifts_to_Entacmaea.dat Out_dat_files_Entacmaeae/
python Create_DATFile_Evolved.Shift_to_Radianthus.py MCcodonNSbranchsites.Shifts_to_Radianthus.dat Out_dat_files_Radianthus/
The .dat files obtained are then run in evolver to generate the simulated alignments:
evolverNSbranchsites 6 DAT_FILES.dat
The simulated alignments and tree were then used in codeml to evaluate power and false positive rate of positive selection analyes. The file Create_CTLFile_CodeML.py was used to create control files and control files were run with codeml. For more information, refer to the publication.
scripts/ASR_adult_host.R Script used to perform the ancestral state reconstruction of reproductive host assocication. It uses data/Absolute_host_assoc/reprod_host.csv and data/BEAST2.DatedTree.WithOutgroup.tree and outputs the data/ASR_adult_host_4st.rds file. It requires the instalation of a few R packages ("ape", "igraph", "mvMORPH", "scales", "sda", "TeachingDemos","png", "corHMM","phytools") that can be installed with the function (install.packages("package-name"))
Rscript scripts/ASR_adult_host.R
scipts/BGB_fit.R Script used to perform the ancestral state reconstruction of biogeographic region. It uses data embeded into data/DEC.tar.gz and outputs data/geo_obj.rds and data/list_geo_obs.rds. It requires the instalation of a few R packages ("ape","BioGeoBEARS", "GenSA", "FD", "snow", "parallel","cladoRcpp","rexpokit") that can be installed with the function (install.packages("package-name")).
Rscript scripts/BGB_fit.R
scripts/PCM_fit.R Script used to perform the phylogenetic comparative analyses. It uses data embeded into the data folder. First part performs multivariate phylogenetic anova. Second part performs model testing and parameter estimations using multivariate and univariate datasets and ancestral state reconstruction joint maps. Third part performs model testing and parameter estimations using multivariate and univariate datasets and 100 stochastic maps from marginal ancestral state reconstructions. Results are saved into .rds files. It requires the instalation of a few R packages ("ape", "phytools", "mvMORPH", "RPANDA", "geiger","OUwie") that can be installed with the function (install.packages("package-name")).
Rscript scripts/PCM_fit.R
scripts/manova_var.R Script used to estimate uncertainties on the mANOVA that are due to intraspecific variation. It uses data embeded into the data folder. Results are saved into .rds files. It requires the instalation of a few R packages ("ape", "phytools", "mvMORPH", "RPANDA", "geiger","OUwie") that can be installed with the function (install.packages("package-name")).
Rscript scripts/manova_var.R
创建时间:
2024-12-02



