Datasets for Léonard et al. Was the last bacterial common ancestor a monoderm after all?

Mendeley Data2024-06-29 更新2024-06-28 收录

下载链接：

https://figshare.com/articles/dataset/Datasets_for_L_onard_et_al_Was_the_last_bacterial_common_ancestor_a_monoderm_after_all_/14932386

下载链接

链接失效反馈

官方服务：

资源简介：

Léonard et al. 2021: Archive content for v3 (= v2 public)Overview ... 27 directories, 320 files AlignmentsDCW_Single_GenesFor each of the 17 genes of the dcw gene cluster, the alignement in .ali format and the alignment in .phy (PHYLIP) format are available. The difference between the two formats is that the .phy files are cleaned and that their sequence names have been shortened. Cleaning statistics can be found in the corresponding a2p-stat files, whereas .idm files remap the original names to the new shorter names. .idm files in the modified_idm directory can tag sequences with a letter indicating if it is encoded in the main cluster (M), in a sub-cluster (S) or by a singleton gene (A).MCL groups corresponding to the dcw gene cluster genes are as follows:MCLdcw110095 MurAMCLdcw110144 FtsZMCLdcw110164 DdlBMCLdcw110196 FtsIMCLdcw110216 MraWMCLdcw110253 MurEMCLdcw110276 FtsWMCLdcw110295 MraYMCLdcw110307 MurCMCLdcw110309 MurGMCLdcw110351 MurDMCLdcw110389 MurFMCLdcw110652 FtsAMCLdcw110718 MurBMCLdcw110780 MraZMCLdcw113075 FtsQMCLdcw113678 FtsLDCW_SupermatrixThis directory contains the supermatrix based on 15 genes of the dcw cluster. FtsQ and FtsL being difficult to identify with certainty, there were excluded from the supermatrix. The latter is provided in .phy (PHYLIP) format before and after cleaning (in the cleaned sub-directory). Cleaning statistics can be found in the a2p-stat file, whereas .idm files remap the original names to the new shorter names.OGs_117_main_treeThis directory contains the supermatrices based on the 117 single-copy orthologous groups of genes that are the most common in our selection of genomes:misgen_14.ali is the supermatrix for 101 speciesmisgen_14-fltrd.ali is the supermatrix for 85 speciesscafos.fasta is the supermatrix misgen_14-fltrd.ali after further cleaning based on the results of Phylo-MCOAOM_single_genesFor each of the 16 genes related to the outer membrane, the alignement in .fasta format and the alignment in .phy (PHYLIP) format are available. Again, the difference between the two formats is that the .phy files are cleaned and that their sequence names have been shortened. Cleaning statistics can be found in the corresponding a2p-stat files, whereas .idm files remap the original names to the new shorter names.BayesTraits (BT)For each cell-wall character, both raw results and summarized results are available in csv format:membrane: BayesTraits_results_monoderm_membrane.csv / BayesTraits_resume_monoderm_membrane.csvpeptidoglycan: BayesTraits_results_monoderm_peptidoglycan.csv / BayesTraits_resume_monoderm_peptidoglycan.csvMoreover, raw results of our attempt to force the LBCA to be a diderm are also provided in BayesTraits_results_monoderm_membrane_weighted_rates.csv.Outer_membrane (OM)The detailed HMM search results used to create Figure 3b are in OM_genes_presence-hmms.csv. Raw files can be found in the directories profiles, hmmer and ompapa. The .fasta files of the latter directory are the unaligned versions of those of the Alignments/OM_single_genes directory (see above). The 4 .pdf files in the synteny_output directory are the figures produced by our tool for visualising the synteny of OM-related genes in our selection of 85 bacteria:LptABC_85_sorted.pdfLptFG_a_85_sorted.pdfLptFG_b_85_sorted.pdfTol_Pal_system_85_sorted.pdfProCARsThe raph-cluster_all-ftsQL.xlsx file is a summary of the input given to ProCARs and its output, which was then used to produce Figure 3a. The misgen_14-fltrd-CATGTRG-1000-1-5000-CF_root-mono_-nums.png is our reference tree. Numbered nodes correspond to column 1 of the .xlsx file. The synteny_85_dcw.pdf file is the figure produced by our tool for visualising the synteny of the dcw gene cluster in our selection of 85 bacteria.ScriptsThe LBCA_pipeline.md file contains the command lines used to launch the different scripts and files stored in the sub-directories (see details below). The R_session_info.md file contains the result of the sessionInfo() command line in R for the main laptop used for light computations and the main HPC system used for heavy computations.bayestraitsThe five .sh are bash scripts used to launch BayesTraits using the main tree with the Terrabacteria rooting. There is one bash script for each of the five models used with BayesTraits. The setup-bayestraits.pl perl script is used to convert a tree file from the .tre (Newick) format to the .nex (NEXUS) format.procarsThis directory contains a number of data files and bash/perl scripts:bacteria.cls determines the color used for each phylumblock_mcl.txt maps the MCL id to the block id used by ProCARsmcl_protname.txt maps the MCL id to the protein namesetup-procars.pl prepares the main tree to be used by ProCARsprocars-filter.pl and procars_help.pl help the user to create the input file for ProCARs with gene position dataprocars_jobs.sh launches ProCARs for every node of the treeprocars-postpro_v4.pl uses the output of ProCARs to create a .xlsx file summarizing the input and the output results in human friendly formsyntenyThe two .R scripts are the synteny program and its configuration file. The synteny_GUI.R produced the .pdf files in the Outer_membrane directory.tqmdThe .R and the .pl files are the early version of ToRQuEMaDA used in this study and which was split in several independent scripts at that time. Briefly, tabulate_kmer.pl converts the compseq output into an easier-to-handle format for the clustering script, tqmd_v1.R. stat_names.pl prepares the result file with the quality score of each proteome for use by the best_choice_v2.pl script, which will select the best representative for each group produced by the clustering script.TreesThe directories in Trees follow the same structure as in Alignments. The DCW_17_SG.pdf and LBCA_OM_16_SG.pdf files are the concatenation of the different single-gene trees for the dcw gene cluster and for the OM-related genes, respectively. The raw .tre files are available in the corresponding directories (DCW_Single_Genes and OM_Single_Genes). For the dcw genes, trees were computed both under the PROTGAMMALGF and C60 models, whereas the OM gene trees were only inferred using PROTGAMMALGF.The other files are also in .tre (Newick) format and correspond to supermatrices:misgen_14-CATG-500-45_root.tre is the preliminary tree using 101 speciesmisgen_14-fltrd-CATGTRG-1000-1-5000-all_root.tre is the tree using 85 species and 6 MCMC chainsmisgen_14-fltrd-CATGTRG-1000-1-5000-CF_root.tre is the tree using 85 species and the 2 best MCMC chainsscafos_supermatrix_CATG-AB-1000-025.tre is the tree built from the dcw genes only

创建时间：

2023-06-28