Hydroxymethylation profile of cell free DNA is a biomarker for early colorectal cancer

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/records/5170265

下载链接

链接失效反馈

官方服务：

资源简介：

The files in this data release represent processed data from the FORESEE study conducted by Cambridge Epigenetix Ltd, and reported in the preprint manuscript: "Hydroxymethylation profile of cell free DNA is a biomarker for early colorectal cancer" (Walker et al. 2021). As described in the manuscript, classifiers were trained and validated on genomic features extracted from sequencing datasets across cases and controls. Several classes of genomic features were constructed for training and validation data sets which are described below: CRC_enhancer_znorm_training_matrix_v1.csv CRC_enhancer_znorm_validation_matrix_v1.csv Columns contain sample names, rows contain genomic features. Description of feature generation process. To calculate 5hmC levels at gene enhancers, we first calculated read counts using Bam readcounts v0.01. RPKM were calculated over candidate gene-enhancers downloaded from GeneCards v4.4. 5hmC enrichment was computed as the log2 ratio between the hydroxymethylome library RPKM and the input library RPKM after the inclusion of pseudocounts. Feature scaled (z-score normalization) 5hmC levels of enhancers quantile-normalized over samples. CRC_cegxdelfi_znorm_training_matrix_v1.tsv CRC_cegxdelfi_znorm_validation_matrix_v1.tsv Columns contain sample names, rows contain genomic features. Description of feature generation process. We divided the genome into 100KB bins and quantified cfDNA fragment sizes per bin. We removed blacklisted regions, genomic gaps (UCSC table) and non-standard chromosomes a priori. We excluded outlier bins in fragment size, only retaining fragments between 100nt to 220nt length. Finally, we split the genome into 100KB bins (in total 26170 non-overlapping genomic regions) and calculated the following characteristics of fragment size distribution per genomic bin: number of short fragments (100-150nt), number of long fragments (151-220nt), ratio between short and long fragments and the total number of fragments. This approach generates 26170 features per metric and per sample. The last step is the averaging of the 100 KB bins into larger non-overlapping genomic regions of 5 MB (in total 512 bins). CRC_cegxnps_znorm_training_matrix_v1.tsv CRC_cegxnps_validation_matrix_v1.tsv Columns contain sample names, rows contain genomic features. Description of feature generation process. Further detail in the manuscript: Walker et al. 2021 FORESEE_sample_description.tsv This file holds sample data for colorectal cancer and control samples described in Walker et al. 2021 The sample_name column links to the column names in the *_matrix.tsv files The columns denoted raw_file1 and raw_file2 link the sample metadata with the enhancer readcount files contained in the gh_readcount_training.tar and gh_readcount_validation.tar. The columns in the table are briefly described below: sample_name: Sample identifier Title: Composed of the the disease name, gender and sample_name Source_name: Tissue source Organism: Contains the term: “Homo sapiens” Characteristics_indication: Disease indication Characteristics_stage: Cancer stage where appropriate. Indicated by roman numerals (I,II,III,IV) Characteristics_gender: Described as “Female” or “Male” Characteristics_ethnicity: Ethnicity description Characteristics_age_at_collection: Age value in years Molecule: Contains the value “cell free DNA” Description: Contains value: “Training sample” or “Validation sample” Processed_data_file: Contains the term: “CRC_enhancer_training_matrix” or “CRC_enhancer_validation_matrix”. raw_file1: Refers to the readcount file from the 5hmC capture library raw_file2: Refers to the readcount file from the Input control (shallow sequenced) library gh_readcount_training.tar gh_readcount_validation.tar These tar files include the raw read counts computed across enhancer regions for case and control data and are referenced in the FORESEE_sample_description.tsv file. Manuscript Abstract Our classifier discriminated CRC samples from controls with an area under the receiver operating characteristic curve (AUC) of 90% (sensitivity was 55% at 95% specificity). Performance was similar for early stage 1 (AUC 89%) and late stage 4 CRC (AUC 94%). Performance was independent of the proportion of tumor-DNA in the cell free DNA. We expanded the classifier to include information about cell free DNA fragment size and abundance across the genome. Overall performance was similar (AUC 91%), with gains in sensitivity (63% at 95% specificity). The 5-hydroxymethylcytosine signal allows detection of CRC, even in cell free DNA samples with undetectable tumor DNA. Including 5-hydroxymethylcytosine in multi-analyte screening, will improve sensitivity for early-stage cancer.

创建时间：

2022-07-13