A Distributed Whole Genome Sequencing Benchmark Study

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://www.ncbi.nlm.nih.gov/sra/SRP278908

下载链接

链接失效反馈

官方服务：

资源简介：

Population sequencing at a national or international scale often requires collaboration across a distributed network of sequencing centres, pooling the capacity of existing investments and infrastructure for the timely processing of thousands of samples. In such massive efforts, it is important that participating scientists can be confident that the accuracy of the sequence data produced are not affected by which centre generates the data. A study to assess capabilities in this regard was conducted across three independently established sequencing centres, located in Montreal, Toronto, and Vancouver, constituting Canada's Genomics Enterprise (CGEn; www.cgen.ca). Whole genome sequencing was performed at each centre, using existing protocols and the current industry-standard short-read technology, on three genomic DNA replicates isolated from three well-characterized cell lines. The standard secondary analysis pipeline employed by each site was then applied to the sequence data from each of the sites, resulting in three datasets for each of four variables (cell line, replicate, sequencing centre, and analysis pipeline), for a total of 81 datasets. These datasets were each assessed according to multiple quality metrics including sequence-level statistics and concordance with benchmark variant truth sets to verify consistent quality across all three conditions for each variable. Three-way concordance (overlap) analysis of variants across conditions for each variable was also performed. Our results showed that while differences between the sequence output of different geographical sites were detectable, the variant concordance between datasets differing only by sequencing centre was similar to the concordance for datasets differing only by replicate, using the same analysis pipeline. We also showed that the main statistically significant differences between datasets result from the analysis pipeline used, something which can be unified across sites, and updated as new approaches become available. We conclude that multi-site genome sequencing projects can rely on the quality and reproducibility of data generated across a distributed network.

创建时间：

2020-10-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集