Experimental data for "An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore Sequenced Reads"

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10943281

下载链接

链接失效反馈

官方服务：

资源简介：

The experimental dataset used in "An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore Sequenced Reads." A set of 91,766 150-nt oligos were synthesised with GenScript (oligos.fasta). Each oligo consists of a pseudo-random 110-nt payload flanked by 20-nt primers at each end. The strands are split in three roughly equal groups (two groups of 30,589 and one group of 30,588). Each group has a dedicated primer pair for targeted PCR amplification (the primer pairs used for amplification are provided in primers_synthesis.fasta). The pseudo-random payload was designed to avoid primer-payload collisions. For each file, a sample from the synthesised pool was PCR amplified using the corresponding primer pair and sequenced using Oxford Nanopore Technologies MinION sequencing device following the standard library preparation protocol for amplicon DNA. The raw reads were basecalled using guppy, either in fast- ("acc-false") or high-accuracy ("acc-true") regime. The basecaller generated two groups of reads—"passQ-true" for the reads that passed the quality-score threshold of 8 and "passQ-false" for those that did not. For each group of reads, a BLAST-based fuzzy search for primer sequences was performed and, based on the resulting alignments, the segments containing the correct primer pairs and located at a distance of 150+-15nt were extracted (separately for forward and reverse-complemented reads). The segments are then assigned to the closest synthesized strand based on Levenshtein distance. The resulting clusters are used to estimate the parameters of the end-to-end DNA storage channel model and to test the proposed error-correction scheme. The archive clustered_read_segments.tar.gz contains 12 sub-archives, for each file (0,1,2), accuracy ("acc-true" or "acc-false"), and Q-score ("passQ-true" or "passQ-false"). Within each sub-archive, there are two folders (one for forward read segments and one for backward read segments), and each folder contains two files: one for the reference synthesised (or "transmitted") sequences that correspond to the file in question ("TX__" — e.g., "TX__file=0_accBaCa=true_passQ=true_filter=true_forward_.txt") and another file for the sequenced (or "received") segment clusters ("RX__" — e.g., "RX__file=0_accBaCa=true_passQ=true_filter=true_forward_.txt"). The received clusters in the "RX__" file are ordered in correspondence with the synthesised sequences in the "TX__" file, and a line "===============================" is used as a separator.

创建时间：

2024-05-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集