CUPiD, A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary - classifier data and code
收藏Mendeley Data2024-05-10 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/10678015
下载链接
链接失效反馈官方服务:
资源简介:
This repository holds code behind the article "A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary" by Conway, Pearce, Clipson et al, published in Nature Communications. This contains the code and data required to generate the CUPiD classifier itself. Data and code to reproduce the figures in the paper are available from https://zenodo.org/uploads/10684337 (unrestricted). Methyl-Binding Domain protein sequencing (MBD-Seq) was applied to circulating cell-free DNA (cfDNA) samples derived from patients with a range of known cancer types (143 patients), as well as 106 non-cancerous controls (79 used in training). The objects deposited here include R data files containing qseaSets from the R package qsea, which includes the read counts per sample per 300 base pair window across the genome, as well as information on copy number variation and metadata tables. These are provided in the inputFiles/nextflowOutput folder, and are some of the outputs of the nextflow pipeline. The scripts folder contains numbered sub-folders, with numbered scripts within them, which should be ran in order. The scripts are setup to be run on a PBS-Torque system; files ending ".pbs" should be submitted via qsub, files ending ".sh" should be ran on a node and will submit individual jobs within a loop. R scripts without an associated .pbs or .sh file should just be ran directly. All files should be submitted from the base of the repository (e.g. qsub scripts/01-downloadData/01-getRawData.pbs) to set the paths appropriately via the environment variable PBS_O_WORKDIR. 01-downloadData contains scripts to download and preprocess all the required data. 02-qseaSetNextFlowPipeline contains our custom in-house DSL2 Nextflow pipeline which takes fastq files to processed qseaSets, including QC checks. This requires the fastq files which will be deposited in EGA. 03-convertArrays converts downloaded (pre-processed) arrays into estimated qseaSets (containing solely the array sample), and then mixes each array with each NCC cfDNA at varying proportions. 04-DMRs calculates pairwise DMRs between each class. 05-prepForClassifier selects up to 10000 mixture sets per class, and generates a large table suitable for input into the ML model. 06-fitClassifier fits the ML model using xgboost within the tidymodels framework. This is repeated 100 times with different subsets of the mixture sets as input. 07-applyClassifier applies these classifiers to the "independent test cohort" - the set of 143 known tumour types, 27 additional NCCs and the 41 patients with CUP. These have been ran through the Nextflow pipeline separately to the 79 NCCs used to derive CUPiD, and have not been used to derive the classifier. 08-UMAPs generates some UMAPs on the array data. A subset of these output files are provided in https://zenodo.org/uploads/10684337 , along with the code to reproduce the figures.
创建时间:
2024-02-23



