Cell-free DNA coverage and fragmentomic profiles produced under various sequencing protocols
收藏DataCite Commons2024-08-27 更新2024-09-03 收录
下载链接:
https://figshare.com/articles/dataset/Cell-free_DNA_coverage_and_fragmentomic_profiles_produced_under_various_sequencing_protocols/24459304
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains 4 data sets designed for assessing the validity of domain adaptation algorithms for bias correction in cell-free DNA data (coverage or fragmentomic profiles). Samples processed under the same or similar preanalytical settings are grouped into so-called domains.Each data subset has its own peculiarities:The NIPT set contains 1126 profiles from 563 biological samples processed twice with different library preparation or sequencing methods. The 1126 profiles are split in 12 domains, and each profile from any domain is in pair with a profile from another domain. For example, any profile in domain D3a has been sequenced with the Illumina HiSeq 2000 platform, and originates from a sample that has also been sequenced with the Illumina NovaSeq 6000 platform, resulting in a second profile in domain D3b. Domain names containing the same digit differ by one or 2 preanalytical variable(s) (e.g., the sequencing platform in the case of D3x). The domains present in the NIPT data set are: D1a, D1b, D2a, D2b, D3a, D3b, D4a, D4b, D5a, D5b, D6a, D6b.The OV set contains ovarian carcinoma cases and controls from 2 domains. The 2 domains correspond to 2 different bioinformatics teams, therefore both library preparation and sequencing protocols differ. 64 of the samples have been processed by both teams, resulting in coverage profiles in both domains. The domains present in the OV data set are: D9, D10.The HEMA set contains 238 haematological cancer cases (Hodgkin lymphoma, diffuse large B cell lymphoma and multiple myeloma) and 242 controls in one domain, and only controls (257) in the second domain. The domains present in the HEMA data set are: D7, D8.The FRAG set contains paired-end sequencing data. While some of the samples were originally sequenced for methylation analysis, they have been used for fragmentomic analysis in the present context. The data set contains 74 female controls and 51 breast cancer cases prepared with the NEBNext Enzymatic Methyl-seq kit, as well as 57 female controls prepared with the KAPA HyperPrep kit.The metadata.csv file contains the details of each coverage profile, including data set, domain and category (e.g., healthy, Hodgkin lymphoma). When the profile is paired with a profile from another domain, meaning they originate from the same biological sample, the identifier of the corresponding profile is reported in the "Paired-With" column.For the NIPT, HEMA and OV data sets, each coverage profile was produced by counting, for each 10kb bin, the reads for which the alignment starts in this bin. The profiles were then smoothed by a running average (of size 100 bins). The standard deviation of the running window has been kept track of for each bin position. Each zip file contains all the coverage profiles from one domain, and each coverage profile is stored as a compressed tsv file with the following columns:"BINDEX" is the index of the bin, essentially the row number."CHR" is the chromosome on which the bin is located."MEAN" is the average normalized coverage within a window of 100 contiguous bins centered around current bin. Each bin has a size of 10kb."SD" is the standard deviation computed likewise.The data set also contains 4 supplementary files:gc-content-1000kb.csv: GC-content for each bin of the reference genome HG38. The number of lines in this file corresponds to the number of bins in the coverage profiles files after removing allosomal and mitochondrial bins. -1 corresponds to blacklisted regions.mappability-1000kb.csv: Mappability of each bin.blacklisted-10kb-bins.csv: List of blacklisted 10 kb bins.D11-D12-batches.json: Mapping between the sample names from the FRAG data set to the library preparation period (i.e., month). Sequencing batches are strictly included in library preparation periods. This information is used for leave-one-batch-out cross-validation to prevent batch effects from producing overoptimistic performance.<br>GC content and mappability information was taken from:<br>https://github.com/broadinstitute/ichorCNA/tree/master/inst/extdata
提供机构:
figshare
创建时间:
2023-10-31



