Datasets (raw and true reads) supporting the study "Benchmarking of computational error-correction methods for next-generation sequencing data"

Name: Datasets (raw and true reads) supporting the study "Benchmarking of computational error-correction methods for next-generation sequencing data"
Creator: figshare
Published: 2020-08-26 01:05:45
License: 暂无描述

DataCite Commons2020-08-26 更新2024-07-28 收录

下载链接：

https://figshare.com/articles/Datasets_raw_and_true_reads_supporting_the_study_Benchmarking_of_computational_error-correction_methods_for_next-generation_sequencing_data_/11776413

下载链接

链接失效反馈

官方服务：

资源简介：

We used both simulated and experimental datasets derived from human genomic DNA, human T cell receptor repertoires, and intra-host viral populations. Next, we summarize datasets shared here, i.e., D1, D2, D3, D4, and D5. More details can be found in our paper. D1 dataset: D1 was produced by computational simulations using a customized version of the tool WgSim. We generated simulated data mimicking the WGS human data using a customized version of the tool WgSim. Read coverage varied between 1 and 32. The WgSim fork is available at https://github.com/mandricigor/wgsim. We used RepeatMasker (version 4.0.9) to annotate genome (more precisely, chromosome 21 of the human genome) with a category. We also introduced a category “normal” which consists of sequences not in any of the categories listed in the Method section of our paper. D2 dataset: Raw reads corresponding to 8 samples (SRR1543964, SRR1543965, SRR1543966, SRR1543967, SRR1543968, SRR1543969, SRR1543970, and SRR1543971) were downloaded from https://www.ncbi.nlm.nih.gov/. The error-free (true) reads for the D2 dataset were generated using a UMI-based high-fidelity sequencing protocol, also known as safe-SeqS. D3 dataset: We generated simulated data mimicking the TCR-Seq data using the T cell receptor alpha chain (TCRA). Samples have read lengths of 100bp and read coverage varied between 1 and 32. D4 dataset: D4 corresponds to HIV population sequencing of an infected patient. The error-free (true) reads for the D4 dataset were generated using a UMI-based high-fidelity sequencing protocol. D5 dataset: We prepared the viral dataset D5 using real sequencing data from NCBI with the accession number SRR961514. Each read was assigned to the reference with which it has a minimum number of mismatches. The original error rate in the dataset was 1.44%. We modified these reads as follows: first, we corrected the corresponding portion of errors with a corresponding reference nucleotides to obtain different levels of errors in the datasets (1.44%, 0.33%, 0.1%, 0.033%, 0.01% , 0.0033%, 0.001%, 0.00033%, 0.0001%); We also created datasets with mixtures of two haplotypes with the original 1.44% error rate but with different levels of diversity between haplotypes (Hamming distance=5.94%, 0.29%, 0.02%). We applied a haplotype-based error correction protocol to eliminate sequencing errors from the D5 dataset. For more information, please visit our main repository:https://github.com/Mangul-Lab-USC/benchmarking.error.correction.

提供机构：

figshare

创建时间：

2020-02-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集