Hydroxymethylation profile of cell free DNA is a biomarker for early colorectal cancer
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/records/5170265
下载链接
链接失效反馈官方服务:
资源简介:
The files in this data release represent processed data from the FORESEE study conducted by Cambridge Epigenetix Ltd, and reported in the preprint manuscript: "Hydroxymethylation profile of cell free DNA is a biomarker for early colorectal cancer" (Walker et al. 2021).
As described in the manuscript, classifiers were trained and validated on genomic features extracted from sequencing datasets across cases and controls. Several classes of genomic features were constructed for training and validation data sets which are described below:
CRC_enhancer_znorm_training_matrix_v1.csv
CRC_enhancer_znorm_validation_matrix_v1.csv
Columns contain sample names, rows contain genomic features.
Description of feature generation process.
To calculate 5hmC levels at gene enhancers, we first calculated read counts using Bam readcounts v0.01. RPKM were calculated over candidate gene-enhancers downloaded from GeneCards v4.4. 5hmC enrichment was computed as the log2 ratio between the hydroxymethylome library RPKM and the input library RPKM after the inclusion of pseudocounts. Feature scaled (z-score normalization) 5hmC levels of enhancers quantile-normalized over samples.
CRC_cegxdelfi_znorm_training_matrix_v1.tsv
CRC_cegxdelfi_znorm_validation_matrix_v1.tsv
Columns contain sample names, rows contain genomic features.
Description of feature generation process.
We divided the genome into 100KB bins and quantified cfDNA fragment sizes per bin. We removed blacklisted regions, genomic gaps (UCSC table) and non-standard chromosomes a priori. We excluded outlier bins in fragment size, only retaining fragments between 100nt to 220nt length. Finally, we split the genome into 100KB bins (in total 26170 non-overlapping genomic regions) and calculated the following characteristics of fragment size distribution per genomic bin: number of short fragments (100-150nt), number of long fragments (151-220nt), ratio between short and long fragments and the total number of fragments. This approach generates 26170 features per metric and per sample. The last step is the averaging of the 100 KB bins into larger non-overlapping genomic regions of 5 MB (in total 512 bins).
CRC_cegxnps_znorm_training_matrix_v1.tsv
CRC_cegxnps_validation_matrix_v1.tsv
Columns contain sample names, rows contain genomic features.
Description of feature generation process.
Further detail in the manuscript: Walker et al. 2021
FORESEE_sample_description.tsv
This file holds sample data for colorectal cancer and control samples described in Walker et al. 2021
The sample_name column links to the column names in the *_matrix.tsv files
The columns denoted raw_file1 and raw_file2 link the sample metadata with the enhancer readcount files contained in the gh_readcount_training.tar and gh_readcount_validation.tar.
The columns in the table are briefly described below:
sample_name: Sample identifier
Title: Composed of the the disease name, gender and sample_name
Source_name: Tissue source
Organism: Contains the term: “Homo sapiens”
Characteristics_indication: Disease indication
Characteristics_stage: Cancer stage where appropriate. Indicated by roman numerals (I,II,III,IV)
Characteristics_gender: Described as “Female” or “Male”
Characteristics_ethnicity: Ethnicity description
Characteristics_age_at_collection: Age value in years
Molecule: Contains the value “cell free DNA”
Description: Contains value: “Training sample” or “Validation sample”
Processed_data_file: Contains the term: “CRC_enhancer_training_matrix” or “CRC_enhancer_validation_matrix”.
raw_file1: Refers to the readcount file from the 5hmC capture library
raw_file2: Refers to the readcount file from the Input control (shallow sequenced) library
gh_readcount_training.tar
gh_readcount_validation.tar
These tar files include the raw read counts computed across enhancer regions for case and control data and are referenced in the FORESEE_sample_description.tsv file.
Manuscript Abstract
Our classifier discriminated CRC samples from controls with an area under the receiver operating characteristic curve (AUC) of 90% (sensitivity was 55% at 95% specificity). Performance was similar for early stage 1 (AUC 89%) and late stage 4 CRC (AUC 94%). Performance was independent of the proportion of tumor-DNA in the cell free DNA.
We expanded the classifier to include information about cell free DNA fragment size and abundance across the genome. Overall performance was similar (AUC 91%), with gains in sensitivity (63% at 95% specificity).
The 5-hydroxymethylcytosine signal allows detection of CRC, even in cell free DNA samples with undetectable tumor DNA. Including 5-hydroxymethylcytosine in multi-analyte screening, will improve sensitivity for early-stage cancer.
创建时间:
2022-07-13



