RNA polymerase evolution data files and code (1/2)

Mendeley Data2024-04-13 更新2024-06-28 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.n8pk0p30n

下载链接

链接失效反馈

官方服务：

资源简介：

# RNA polymerase evolution The current dataset contains data files associated with the above study. You can follow the Readme for details on the code files. The FASTQ files are submitted as a separate submission: DOI: 10.5061/dryad.zw3r228c4. ## Description of the data and file structure The information about the fastq data and the associated samples from which they were extracted can be found in the tsv file: Sample checklist\_1674571220193.tsv. Details on the file experimental sample map can be found in the tsv file: fastq2\_template\_1674570722019.tsv Here is a description of the other files. **Place all these files in a subfolder Fitness\\ to be able to run the code:** 1. \_Preenrichment.csv There are multiple files with the description as "\_Preenrichment.csv". The "x\_*Preenrichment.csv" file contains the raw reads count data for each evolution experiment. The name of the selection and is mentioned before the "\_*". For example, GlucoseA\_Preenrichment contains the counts data for each variant for selection in Glucose. In addition, the preenrichment file also contains information about the NUcleotide change and amino acid change in each variant. 2. \_slopes.csv \_slopes.csv contains the fitness/enrichment value for each variant in the database. The methods for slope calculation are explained by the material methods and the code in the GitHub repository. 3. table\_with\_slopes.csv Combines the fitness and enrichment from multiple experiments. 4. rpoB\_structure.txt: the file contains the PDB coordinates for the RNA polymerase Beta subunit. They have been used in the code to map the correlation between fitness and distance from ligands in the RNA polymerase. 5. Grantham.tsv: Contains the Granham scores for the change in amino acids. It has been used in the code to correlate the Grantham score to fitness. 6. combined\_with\_cluster.csv: A dataset with the reads for all the experiments combined and clustered for analysis. The Clustering was done using the code 2 described below. 7. dummy\_data.csv: A smaller dummy dataset to run the code in case the big file is computationally expensive. ## Code/Software The detailed code can be found at: https://github.com/Alaksh/RNA-Polymerase-Evolution.git and Zenodo link 10.5281/zenodo.8144064. All code was written in Python. #### RNA-Polymerase-Evolution Code for submission of RNA polymerase evolution data The supplied code summarizes all the preprocessing, calculations, and codes to generate graphs for the submitted manuscript. The code can be downloaded from GitHub and run on the computer. In order to run the code, you will need the Anaconda Python with dependencies: Pandas, Biopythin, Seaborn, Scipy, and Numpy installed. Additionally, all sequence analysis was done using Usearch algorithm, which needs to be installed as well. Please check specific installation instructions for each of these tools. Here is a brief description of each file. For several codes, the preenrichment tables could not be uploaded. So, I have uploaded alternate processed files to run the code if needed. The code needed to run the files have been marked as comments to run if needed Preenrichment\_calculation.ipynb: Preliminary code to process the sequencing data and generate counts tables. Codes 1 through 3 cannot be run without downloading files from the sequencing repository: DOI: 10.5061/dryad.zw3r228c4. Code 1 Preprocessing: The preprocessing code is to process the raw reads: merge the Fasta files and map it to the RNA polymerase target sequence. Code 2 Clustering of Data: The code is to cluster the reads. We wrote a custom algorithm to cluster the sequences, which takes the sequence read count into consideration. We had regions within the target that were sequences, which were not mutated. We used these regions to estimate an error frequency. We set a threshold above which, we considered a variant to be real and not an artifact of sequencing errors. We then used the error frequency to identify possible sequences that were a part of the cluster. Code 3 Fitness estimates: For each condition, the fitness was estimated using the algorithm described in the manuscript. Code 4 Code 4 Glucose fitness effects (Alternate file: Dummy\_data.csv) The code covers all figures for Figure 5, where we estimate the growth-associated fitness. Code 5 Analysis KEIO deltolC (Alternate file: keio\_deltoC\_slopes.csv) The code covers all figures for Figure 5, where we estimate the CBR703-associated enrichment. Code 6 and 7 epistasis (Alternate file: Dummy\_data.csv and keio\_deltoC\_slopes.csv): The code covers all figures for Figure 5, with growth-associated and CBR703-associated epistasis. Code 8 (Alternate file: Dummy\_data.csv and keio\_deltoC\_slopes.csv): Code 8 covers Figure 2, where all the comparison of fitness was done between conditions Code 9 and code 10 fitness and epistasis in delta relA spoT strains (Alternate file: relA\_spoT\_slopes.csv, Dummy\_data.csv and keio\_deltoC\_slopes.csv): Code 9 covers figure 4, where we describe strignent mutations and epistasis for stringent enrichment. Code 11 (delrelAspoT comparision and analysis): where we describe strignent mutations and epistasis for stringent enrichment.

创建时间：

2023-11-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集