five

Datasets (raw and true reads) supporting the study "Benchmarking of computational error-correction methods for next-generation sequencing data"

收藏
DataCite Commons2020-08-26 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/Datasets_raw_and_true_reads_supporting_the_study_Benchmarking_of_computational_error-correction_methods_for_next-generation_sequencing_data_/11776413
下载链接
链接失效反馈
官方服务:
资源简介:
We used both simulated and experimental datasets derived from human genomic DNA, human T cell receptor repertoires, and intra-host viral populations. Next, we summarize datasets shared here, i.e., D1, D2, D3, D4, and D5. More details can be found in our paper.<b><br></b><b>D1 dataset:</b> D1 was produced by computational simulations using a customized version of the tool WgSim. We generated simulated data mimicking the WGS human data using a customized version of the tool WgSim. Read coverage varied between 1 and 32. The WgSim fork is available at https://github.com/mandricigor/wgsim. We used RepeatMasker (version 4.0.9) to annotate genome (more precisely, chromosome 21 of the human genome) with a category. We also introduced a category “normal” which consists of sequences not in any of the categories listed in the Method section of our paper.<br><br><b>D2 dataset:</b> Raw reads corresponding to 8 samples (SRR1543964, SRR1543965, SRR1543966, SRR1543967, SRR1543968, SRR1543969, SRR1543970, and SRR1543971) were downloaded from https://www.ncbi.nlm.nih.gov/. The error-free (true) reads for the D2 dataset were generated using a UMI-based high-fidelity sequencing protocol, also known as safe-SeqS.<b><br></b><b>D3 dataset:</b> We generated simulated data mimicking the TCR-Seq data using the T cell receptor alpha chain (TCRA). Samples have read lengths of 100bp and read coverage varied between 1 and 32. <br><b>D4 dataset:</b> D4 corresponds to HIV population sequencing of an infected patient. The error-free (true) reads for the D4 dataset were generated using a UMI-based high-fidelity sequencing protocol.<br><br><b>D5 dataset:</b> We prepared the viral dataset D5 using real sequencing data from NCBI with the accession number SRR961514. Each read was assigned to the reference with which it has a minimum number of mismatches. The original error rate in the dataset was 1.44%. We modified these reads as follows: first, we corrected the corresponding portion of errors with a corresponding reference nucleotides to obtain different levels of errors in the datasets (1.44%, 0.33%, 0.1%, 0.033%, 0.01% , 0.0033%, 0.001%, 0.00033%, 0.0001%); We also created datasets with mixtures of two haplotypes with the original 1.44% error rate but with different levels of diversity between haplotypes (Hamming distance=5.94%, 0.29%, 0.02%). We applied a haplotype-based error correction protocol to eliminate sequencing errors from the D5 dataset.<br>For more information, please visit our main repository:https://github.com/Mangul-Lab-USC/benchmarking.error.correction. <br><br>
提供机构:
figshare
创建时间:
2020-02-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作