Datasets for Léonard et al. Was the last bacterial common ancestor a monoderm after all?
收藏DataCite Commons2022-01-04 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_for_L_onard_et_al_Was_the_last_bacterial_common_ancestor_a_monoderm_after_all_/14932386/1
下载链接
链接失效反馈官方服务:
资源简介:
# Léonard et al. 2021: Archive content for v2<br>## Overview<br>[see `README.md` for directory structure]<br>## Alignments<br><br>### DCW_Single_Genes<br>For each of the 17 genes of the *dcw* gene cluster, the alignement in `.ali` format and the alignment in `.phy` (PHYLIP) format are available.The difference between the two formats is that the `.phy` files are cleaned and that their sequence names have been shortened.Cleaning statistics can be found in the corresponding `a2p-stat` files, whereas `.idm` files remap the original names to the new shorter names.`.idm` files in the `modified_idm` directory can tag sequences with a letter indicating if it is encoded in the main cluster (M), in a sub-cluster (S) or by a singleton gene (A).<br>MCL groups corresponding to the *dcw* gene cluster genes are as follows:<br>- MCLdcw110095 MurA- MCLdcw110144 FtsZ- MCLdcw110164 DdlB- MCLdcw110196 FtsI- MCLdcw110216 MraW- MCLdcw110253 MurE- MCLdcw110276 FtsW- MCLdcw110295 MraY- MCLdcw110307 MurC- MCLdcw110309 MurG- MCLdcw110351 MurD- MCLdcw110389 MurF- MCLdcw110652 FtsA- MCLdcw110718 MurB- MCLdcw110780 MraZ- MCLdcw113075 FtsQ- MCLdcw113678 FtsL<br>### DCW_Supermatrix<br>This directory contains the supermatrix based on 15 genes of the *dcw* cluster. FtsQ and FtsL being difficult to identify with certainty, there were excluded from the supermatrix. The latter is provided in `.phy` (PHYLIP) format before and after cleaning (in the `cleaned` sub-directory). Cleaning statistics can be found in the `a2p-stat` file, whereas `.idm` files remap the original names to the new shorter names.<br>### OGs_117_main_tree<br>This directory contains the supermatrices based on the 117 single-copy orthologous groups of genes that are the most common in our selection of genomes:<br>- `misgen_14.ali` is the supermatrix for 101 species- `misgen_14-fltrd.ali` is the supermatrix for 85 species- `scafos.fasta` is the supermatrix `misgen_14-fltrd.ali` after further cleaning based on the results of Phylo-MCOA<br>### OM_single_genes<br>For each of the 16 genes related to the outer membrane, the alignement in `.fasta` format and the alignment in `.phy` (PHYLIP) format are available.Again, the difference between the two formats is that the `.phy` files are cleaned and that their sequence names have been shortened.Cleaning statistics can be found in the corresponding `a2p-stat` files, whereas `.idm` files remap the original names to the new shorter names.<br>## BayesTraits (BT)<br>For each cell-wall character, both raw results and summarized results are available in `csv` format:<br>- membrane: `BayesTraits_results_monoderm_membrane.csv` / `BayesTraits_resume_monoderm_membrane.csv`- peptidoglycan: `BayesTraits_results_monoderm_peptidoglycan.csv` / `BayesTraits_resume_monoderm_peptidoglycan.csv`<br>Moreover, raw results of our attempt to force the LBCA to be a diderm are also provided in `BayesTraits_results_monoderm_membrane_weighted_rates.csv`.<br>## Outer_membrane (OM)<br>The detailed HMM search results used to create Figure 3b are in `OM_genes_presence - hmms.csv`.The 4 `.pdf` files in the `synteny_output` directory are the figures produced by our tool for visualising the synteny of OM-related genes in our selection of 85 bacteria: <br>- `LptABC_85_sorted.pdf`- `LptFG_a_85_sorted.pdf`- `LptFG_b_85_sorted.pdf`- `Tol_Pal_system_85_sorted.pdf`<br>## ProCARs<br>The `raph-cluster_all-ftsQL.xlsx` file is a summary of the input given to ProCARs and its output, which was then used to produce Figure 3a.The `misgen_14-fltrd-CATGTRG-1000-1-5000-CF_root-mono_-nums.png` is our reference tree. Numbered nodes correspond to column 1 of the `.xlsx` file.The `synteny_85_dcw.pdf` file is the figure produced by our tool for visualising the synteny of the *dcw* gene cluster in our selection of 85 bacteria.<br>## Scripts<br>The `LBCA_pipeline.md` file contains the command lines used to launch the different scripts and files stored in the sub-directories (see details below). The `R_session_info.md` file contains the result of the `sessionInfo()` command line in R for the main laptop used for light computations and the main HPC system used for heavy computations.<br>### bayestraits<br>The five `.sh` are bash scripts used to launch BayesTraits using the main tree with the Terrabacteria rooting. There is one bash script for each of the five models used with BayesTraits. The `setup-bayestraits.pl` perl script is used to convert a tree file from the `.tre` (Newick) format to the `.nex` (NEXUS) format.<br>### procars<br>This directory contains a number of data files and bash/perl scripts:<br>- `bacteria.cls` determines the color used for each phylum- `block_mcl.txt` maps the MCL id to the block id used by ProCARs- `mcl_protname.txt` maps the MCL id to the protein name- `setup-procars.pl` prepares the main tree to be used by ProCARs- `procars-filter.pl` and `procars_help.pl` help the user to create the input file for ProCARs with gene position data- `procars_jobs.sh` launches ProCARs for every node of the tree- `procars-postpro_v4.pl` uses the output of ProCARs to create a `.xlsx` file summarizing the input and the output results in human friendly form<br>### synteny<br>The two `.R` scripts are the synteny program and its configuration file. The `synteny_GUI.R` produced the `.pdf` files in the `Outer_membrane` directory.<br>### tqmd<br>The `.R` and the `.pl` files are the early version of ToRQuEMaDA used in this study and which was split in several independent scripts at that time. Briefly, `tabulate_kmer.pl` converts the `compseq` output into an easier-to-handle format for the clustering script, `tqmd_v1.R`. `stat_names.pl` prepares the result file with the quality score of each proteome for use by the `best_choice_v2.pl` script, which will select the best representative for each group produced by the clustering script.<br>## Trees<br>The `DCW_17_SG.pdf` and `LBCA_OM_16_SG.pdf` files are the concatenation of the different single-gene trees for the *dcw* gene cluster and for the OM-related genes, respectively.<br>The three other files are in `.tre` (Newick) format:<br>- `misgen_14-CATG-500-45_root.tre` is the preliminary tree using 101 species- `misgen_14-fltrd-CATGTRG-1000-1-5000-all_root.tre` is the tree using 85 species and 6 MCMC chains- `misgen_14-fltrd-CATGTRG-1000-1-5000-CF_root.tre` is the tree using 85 species and the 2 best MCMC chains<br>
提供机构:
figshare
创建时间:
2021-12-01



