Datasets for Léonard et al. Was the last bacterial common ancestor a monoderm after all?
收藏DataCite Commons2022-01-04 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_for_L_onard_et_al_Was_the_last_bacterial_common_ancestor_a_monoderm_after_all_/14932386
下载链接
链接失效反馈官方服务:
资源简介:
Léonard et al. 2021: Archive content for v3 (= v2 public)Overview<pre><code>... 27 directories, 320 files </code></pre>AlignmentsDCW_Single_GenesFor each of the 17 genes of the <em>dcw</em> gene cluster, the alignement in <code>.ali</code> format and the alignment in <code>.phy</code> (PHYLIP) format are available. The difference between the two formats is that the <code>.phy</code> files are cleaned and that their sequence names have been shortened. Cleaning statistics can be found in the corresponding <code>a2p-stat</code> files, whereas <code>.idm</code> files remap the original names to the new shorter names. <code>.idm</code> files in the <code>modified_idm</code> directory can tag sequences with a letter indicating if it is encoded in the main cluster (M), in a sub-cluster (S) or by a singleton gene (A).MCL groups corresponding to the <em>dcw</em> gene cluster genes are as follows:MCLdcw110095 MurAMCLdcw110144 FtsZMCLdcw110164 DdlBMCLdcw110196 FtsIMCLdcw110216 MraWMCLdcw110253 MurEMCLdcw110276 FtsWMCLdcw110295 MraYMCLdcw110307 MurCMCLdcw110309 MurGMCLdcw110351 MurDMCLdcw110389 MurFMCLdcw110652 FtsAMCLdcw110718 MurBMCLdcw110780 MraZMCLdcw113075 FtsQMCLdcw113678 FtsLDCW_SupermatrixThis directory contains the supermatrix based on 15 genes of the <em>dcw</em> cluster. FtsQ and FtsL being difficult to identify with certainty, there were excluded from the supermatrix. The latter is provided in <code>.phy</code> (PHYLIP) format before and after cleaning (in the <code>cleaned</code> sub-directory). Cleaning statistics can be found in the <code>a2p-stat</code> file, whereas <code>.idm</code> files remap the original names to the new shorter names.OGs_117_main_treeThis directory contains the supermatrices based on the 117 single-copy orthologous groups of genes that are the most common in our selection of genomes:<code>misgen_14.ali</code> is the supermatrix for 101 species<code>misgen_14-fltrd.ali</code> is the supermatrix for 85 species<code>scafos.fasta</code> is the supermatrix <code>misgen_14-fltrd.ali</code> after further cleaning based on the results of Phylo-MCOAOM_single_genesFor each of the 16 genes related to the outer membrane, the alignement in <code>.fasta</code> format and the alignment in <code>.phy</code> (PHYLIP) format are available. Again, the difference between the two formats is that the <code>.phy</code> files are cleaned and that their sequence names have been shortened. Cleaning statistics can be found in the corresponding <code>a2p-stat</code> files, whereas <code>.idm</code> files remap the original names to the new shorter names.BayesTraits (BT)For each cell-wall character, both raw results and summarized results are available in <code>csv</code> format:membrane: <code>BayesTraits_results_monoderm_membrane.csv</code> / <code>BayesTraits_resume_monoderm_membrane.csv</code>peptidoglycan: <code>BayesTraits_results_monoderm_peptidoglycan.csv</code> / <code>BayesTraits_resume_monoderm_peptidoglycan.csv</code>Moreover, raw results of our attempt to force the LBCA to be a diderm are also provided in <code>BayesTraits_results_monoderm_membrane_weighted_rates.csv</code>.Outer_membrane (OM)The detailed HMM search results used to create Figure 3b are in <code>OM_genes_presence-hmms.csv</code>. Raw files can be found in the directories <code>profiles</code>, <code>hmmer</code> and <code>ompapa</code>. The <code>.fasta</code> files of the latter directory are the unaligned versions of those of the <code>Alignments/OM_single_genes</code> directory (see above). The 4 <code>.pdf</code> files in the <code>synteny_output</code> directory are the figures produced by our tool for visualising the synteny of OM-related genes in our selection of 85 bacteria:<code>LptABC_85_sorted.pdf</code><code>LptFG_a_85_sorted.pdf</code><code>LptFG_b_85_sorted.pdf</code><code>Tol_Pal_system_85_sorted.pdf</code>ProCARsThe <code>raph-cluster_all-ftsQL.xlsx</code> file is a summary of the input given to ProCARs and its output, which was then used to produce Figure 3a. The <code>misgen_14-fltrd-CATGTRG-1000-1-5000-CF_root-mono_-nums.png</code> is our reference tree. Numbered nodes correspond to column 1 of the <code>.xlsx</code> file. The <code>synteny_85_dcw.pdf</code> file is the figure produced by our tool for visualising the synteny of the <em>dcw</em> gene cluster in our selection of 85 bacteria.ScriptsThe <code>LBCA_pipeline.md</code> file contains the command lines used to launch the different scripts and files stored in the sub-directories (see details below). The <code>R_session_info.md</code> file contains the result of the <code>sessionInfo()</code> command line in R for the main laptop used for light computations and the main HPC system used for heavy computations.bayestraitsThe five <code>.sh</code> are bash scripts used to launch BayesTraits using the main tree with the Terrabacteria rooting. There is one bash script for each of the five models used with BayesTraits. The <code>setup-bayestraits.pl</code> perl script is used to convert a tree file from the <code>.tre</code> (Newick) format to the <code>.nex</code> (NEXUS) format.procarsThis directory contains a number of data files and bash/perl scripts:<code>bacteria.cls</code> determines the color used for each phylum<code>block_mcl.txt</code> maps the MCL id to the block id used by ProCARs<code>mcl_protname.txt</code> maps the MCL id to the protein name<code>setup-procars.pl</code> prepares the main tree to be used by ProCARs<code>procars-filter.pl</code> and <code>procars_help.pl</code> help the user to create the input file for ProCARs with gene position data<code>procars_jobs.sh</code> launches ProCARs for every node of the tree<code>procars-postpro_v4.pl</code> uses the output of ProCARs to create a <code>.xlsx</code> file summarizing the input and the output results in human friendly formsyntenyThe two <code>.R</code> scripts are the synteny program and its configuration file. The <code>synteny_GUI.R</code> produced the <code>.pdf</code> files in the <code>Outer_membrane</code> directory.tqmdThe <code>.R</code> and the <code>.pl</code> files are the early version of ToRQuEMaDA used in this study and which was split in several independent scripts at that time. Briefly, <code>tabulate_kmer.pl</code> converts the <code>compseq</code> output into an easier-to-handle format for the clustering script, <code>tqmd_v1.R</code>. <code>stat_names.pl</code> prepares the result file with the quality score of each proteome for use by the <code>best_choice_v2.pl</code> script, which will select the best representative for each group produced by the clustering script.TreesThe directories in <code>Trees</code> follow the same structure as in <code>Alignments</code>. The <code>DCW_17_SG.pdf</code> and <code>LBCA_OM_16_SG.pdf</code> files are the concatenation of the different single-gene trees for the <em>dcw</em> gene cluster and for the OM-related genes, respectively. The raw <code>.tre</code> files are available in the corresponding directories (<code>DCW_Single_Genes</code> and <code>OM_Single_Genes</code>). For the <em>dcw</em> genes, trees were computed both under the PROTGAMMALGF and C60 models, whereas the OM gene trees were only inferred using PROTGAMMALGF.The other files are also in <code>.tre</code> (Newick) format and correspond to supermatrices:<code>misgen_14-CATG-500-45_root.tre</code> is the preliminary tree using 101 species<code>misgen_14-fltrd-CATGTRG-1000-1-5000-all_root.tre</code> is the tree using 85 species and 6 MCMC chains<code>misgen_14-fltrd-CATGTRG-1000-1-5000-CF_root.tre</code> is the tree using 85 species and the 2 best MCMC chains<code>scafos_supermatrix_CATG-AB-1000-025.tre</code> is the tree built from the <em>dcw</em> genes only
提供机构:
figshare
创建时间:
2021-11-30



