Deep sequencing datasets from: RNA-catalyzed evolution of catalytic RNA

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.rxwdbrvgs

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset includes raw and processed sequencing data from evolving RNA populations described in Nikos Papastavrou, David P Horning, Gerald F Joyce. "RNA-Catalyzed Evolution of Catalytic RNA" (submitted). Briefly, directed evolution of a hammerhead ribozyme sequence was carried out over eight rounds of three steps each: 1) templated synthesis of the reverse-complement hammerhead RNA by a polymerase ribozyme; 2) templated synthesis of a new copy of the hammerhead RNA from the reverse-complement by a polymerase ribozyme; 3) selective recovery of hammerhead RNA that cleaved an attached RNA substrate. Cleaved RNA was reverse transcribed, PCR amplified, and archived for sequencing, while a portion was in vitro transcribed with T7 RNA polymerase to initiate the next round of evolution. Two distinct branches of evolution were carried out for 8 rounds, using the '52-2' or '71-89' polymerase ribozymes to replicate RNA, respectively. Sequenced RNA populations were analyzed to determine polymerase ribozyme fidelity and study the evolution of hammerhead sequences replicated by polymerases with low or high RNA copying fidelities. The dataset includes raw sequence files, processed tables of mutations by position along the sequence for each polymerase, processed tables of the sequence frequency distribution from each round in the evolving populations, and spreadsheets containing final processed data used directly in manuscript figures and tables. Methods Directed evolution of a hammerhead ribozyme sequence was carried out in three steps: 1) templated synthesis of the reverse-complement hammerhead RNA by a polymerase ribozyme; 2) templated synthesis of a new copy of the hammerhead RNA from the reverse-complement by a polymerase ribozyme; 3) selective recovery of hammerhead RNA that cleaved an attached RNA substrate. Cleaved RNA was reverse transcribed, PCR amplified, and archived for sequencing, while a portion was in vitro transcribed with T7 RNA polymerase to initiate the next round of evolution. Two distinct branches of evolution were carried out for 8 rounds, using the '52-2' or '71-89' polymerase ribozymes to replicate RNA, respectively. A third 'reselection' branch took six distinct hammerhead ribozyme sequences, and subjected them to a single round of directed evolution either individually or as an equimolar mixture, using the '71-89' polymerase for RNA replication. RNA products from each of three steps of each round of evolution ('HHR-', 'HHR+ pre-cleaved', and 'HHR+ cleaved', respectively) were reverse transcribed, amplified by PCR, and sequenced in an Illumina NextSeq2000 with a 300-cycle paired-end run. Raw paired-end fastq files were trimmed of Illumina adapter sequences and reads lacking the correct primer sequences (allowing no more than 2 mismatches) were removed using cutadapt v4.3. Paired reads were merged using FLASH v1.2.11, then filtered for reads with a quality score ≥30 at every position using FASTX Toolkit v0.0.14. Two different analytical pipelines were performed. In one, data from the first round of evolution in either of the first two branches were used to determine the per-nucleotide fidelity of each polymerase in synthesizing the HHR- and HHR+ pre-cleaved RNA populations. Fastq files for each library were processed using a custom Python script to enumerate the reads for each distinct sequence in each library, and the results were tabulated to include sequence name, nucleotide sequence, and number of reads. A fasta file of all distinct sequences was also generated. These fasta files were aligned to the corresponding wild-type sequence using bowtie2 v2.4.2, with a permissive scoring function to ensure that highly mutated sequences in the 52-2 lineage were included in the alignment. The generated sam file was converted to a sorted, indexed bam file using SAMtools v1.9. A gapped alignment was generated using the bamtoaln module of breseq v0.35.5. The corresponding sequence table was updated with the gapped, aligned sequences using a custom Python script. The results were tabulated to include the number of substitutions, insertions, or deletions at each nucleotide position using a custom Python script. These tables were processed using Microsoft Excel to obtain the average mutation frequencies for each of the four nucleotide bases. In a second pipeline, data over all eight rounds of evolution were processed for evolution with either polymerase. Fastq files were processed into a table of each distinct sequence and corresponding number of reads in each round using a custom Python script. For each round of evolution, the peak sequences were identified using a modified version of the ClusterBOSS Python script of Blanco and Chen (E. Janzen, Y. Shen, A. Vázquez-Salazar, Z. Liu, C. Blanco, J. Kenchel, I. A. Chen, Emergent properties as by-products of prebiotic evolution of aminoacylation ribozymes. Nat. Commun. 13, 3631 (2022)). For each peak sequence, additional sequences were clustered together with the peak based on the following criteria: i) a peak sequence with ≥1000 reads; ii) additional sequences having no more than 2 mutations relative to the peak; and iii) a cluster, including the peak, with ≥2000 reads. The algorithm sorted all distinct sequences from most to least frequent, determined if a sequence was more abundant than all nearby sequences within 2 mutations, and if so, defined it as the peak sequence and clustered to that peak all other sequences within 2 mutations that were not already members of a cluster. A second script updated the sequence table with columns indicating membership within a cluster for a given round of evolution. Sequences that included the strictly conserved hammerhead nucleotides 5´‑CUGANGA… GAAA-3´, together with a Watson-Crick base pair at the base of stem I and either an R:Y or Y:Y pair at the base of stem II, were identified as matching the biochemically-defined hammerhead motif (D. E. Ruffner, G. D. Stormo, O. C. Uhlenbeck, Sequence requirements of the hammerhead RNA self-cleavage reaction. Biochemistry 29, 10695–10702 (1990)). A custom Python script updated the sequence table with a new column containing Boolean values for whether each sequence matched the hammerhead motif. Finally, a custom Python script was used to determine the Levenshtein distance between each distinct sequence and four reference sequences. The script outputs a sequence-frequency table containing the sequence name, nucleotide sequence, Levenshtein distance to each of the reference sequences, normalized read frequency, a Boolean value for matching the hammerhead motif, and membership within any cluster. Output sequence-frequency tables were further processed to generate figures and tables as presented in N. Papastavrou, D. P. Horning, G. F. Joyce, RNA-Catalyzed Evolution of Catalytic RNA, submitted (2023). The following describes presentation related to the sequencing dataset directly. Methods for measuring and plotting biochemically-derived data is provided directly in the manuscript. Table S2 and S3, Copying and replication fidelity of the 52-2 and 71-89 polymerases. These tables were prepared directly from the spreadsheet generated by the polymerase fidelity pipeline described above, without further processing. Fig. 2. RNA-catalyzed evolution of the hammerhead ribozyme. A custom Python script was used to plot the mutation-frequency distribution relative to the wild-type sequence as smoothed violin plots using the matplotlib and seaborn libraries. The frequency of sequences matching the hammerhead motif was determined for each round of each lineage in Rstudio using a custom R notebook and plotted using Graphpad Prism. These two plots, along with the plot of hammerhead population RNA cleavage activity, were combined in Adobe Illustrator. Fig. S1. Selective enrichment of hammerhead variants during RNA-catalyzed evolution. Relative changes in frequency between round 7 and either round 8 or a mock round 8 were calculated for each sequence that was present in round 7 at >0.01% frequency, corresponding to a sequencing depth of at least 50 reads, in Rstudio using a custom R notebook. Histograms were generated in Rstudio using ggplot2. Fig. S2. Sequence diversity over the course of RNA-catalyzed evolution. The average mutational distance between sequences and the Shannon entropy of the population were determined from sub-sampled sequence datasets in Rstudio using a custom R notebook and plotted using Graphpad Prism. The average distance between sequences in each population of cleaved HHR+ RNAs was determined by randomly sampling 100,000 pairs of sequences from the list of distinct sequences, based on the frequency of the sequence in a given round and averaging the Levenshtein distance between each pair divided by the number of variable nucleotides in the first member of the pair. The normalized Shannon population entropy, which was also determined from a sample of 100,000 sequences, is defined as the sum over all distinct sequences: ∑ Fs • ln(Fs) / ln(1/N), where Fs is the frequency of each distinct sequence in the sample and N is the total number of sequences in the population. Sub-sampling of sequences was carried out to ensure that entropy values were determined at the same sequencing depth (J. Gregori, C. Perales, F. Rodriguez-Frias, J. I. Esteban, J. Quer, E. Domingo, Viral quasispecies complexity measures. Virology 493, 227–237 (2016)). Fig. 3 and S3. Emergence of peak sequences and clusters over the course of evolution catalyzed by the 71-89 and 52-2 polymerases. For 3A and S3A, phylogenetic trees rooted to the wild-type sequence were generated using the neighbor-joining algorithm of Saitou and Nei, encompassing all sequences that reached a maximum frequency >0.1% for the 52-2 lineage and >0.5% for the 71-89 lineage during rounds 3–8 of evolution. Trees were determined in Rstudio using a custom R notebook with the neighbor-joining function in the ape R library and plotted using the ggtree and treeio libraries. For 3B and S3B, the frequency distributions of peak sequences, phylogenetically neighboring sequences, and clusters were determined using the tidytree library and plotted using ggplot in Rstudio Fig. 4B. relative fitness values were determined from experimentally determined values for HHR- RNA synthesis, HHR+ RNA synthesis, and RNA cleavage yields and population values for the fraction of functional hammerheads produced from each variant and the fraction of progeny derived from each variant after re-selection. These latter values were determined from the sequencing of RNA population libraries from the re-selection of hammerhead variants Seq0, Seq2, Seq3, Seq5, Seq15, and Seq35, alone or as a mixture, with all calculations performed in Rstudio using a custom R notebook. Relative fitness values were calculated from this data in Microsoft Excel and plotted in Graphpad Prism. The fraction of functional hammerheads produced from each variant was determined as the fraction of pre-cleaved HHR+ RNA sequences propagated from each individual variant that matched the biochemically determined hammerhead sequence motif. The specific activity of each variant was estimated by multiplying the experiementally determined yield of RNA cleavage by the fraction of functional hammerheads produced from each variant. The frequency of cleaved HHR+ RNA sequences in the mixed population were fit to a multiple linear regression of the frequencies of sequences in each population that were propagated individually. Regression coefficients for the individually propagated populations were used to assign the fractional contribution of corresponding starting RNAs to progeny RNAs in the mixed population. Fitness values for the new hammerhead variants (Seq2-35) relative to the starting variant (Seq0) were calculated in 3 different ways. First, by multiplying the experimentally determined values for HHR- synthesis, HHR+ synthesis, and RNA cleavage yields of each of the variants, as plotted in figure 4A. Second, by the same method, but replacing RNA cleavage yield with the estimate of specific activity determined above. Third, by the relative enrichment of copy number for each variant, as determined from the frequency of each variant in the starting mixture of all six variants and the frequency of progeny from each variant after selection. Fig. 5 and Movie S1. Scatterplots of the evolving populations of hammerhead ribozymes. A 2-dimensional map of the evolving populations in sequence space was constructed based on the Levenshtein distance from each distinct sequence in the population to four reference sequences. The reference sequences were the wild type, a distant non-hammerhead from the 52-2 lineage (variable sequence 5′-…GCUGGUUGCUACAGCCG…-3′, lacking any discernible features of the hammerhead motif) , and Seq15 and Seq35 from the 71-89 lineage. A 2-dimensional plane was defined by the first two principal components of variation of the distance matrix between these four sequences, and the position of all distinct cleaved HHR+ sequences was projected onto this plane. The density of hammerhead functionality across the 2-dimensional plane was estimated as the local average fraction of all sequences in the cleaved HHR+ RNA populations that match the hammerhead motif. Processing and plotting was done in Rstudio with a custom R notebook and the plotly library. Animation of the scatterplots was prepared in Apple Keynote. Details of each pipeline are included in this dataset's README file, including detailed command line calls for each script or program. The dataset includes raw Illumina fastq files, processed fidelity and sequence-frequency tables, and Excel spreadsheets containing data generated by the pipeline that were used to plot manuscript figures. All custom scripts in python and R are also included.

创建时间：

2024-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集