five

The Prostate, Lung, Colon, Ovary Screening Trial (PLCO)

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001286.v4.p2
下载链接
链接失效反馈
官方服务:
资源简介:
The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial is a large population-based randomized trial designed and sponsored by the National Cancer Institute (NCI) to determine the effects of screening on cancer-related mortality and secondary endpoints in over 150,000 men and women aged 55 to 74. The screening component of the trial was completed in 2006. However, participants have been under follow-up for cancer incidence and mortality since that time. In addition, PLCO included a large biological sample biorepository which has served as a unique resource for cancer research, particularly for etiologic and early-marker studies. As part of these efforts, PLCO has been used for a large number of genome-wide association and exome sequencing studies for different types of cancer. Recently, a blood DNA methylation analysis was conducted in annexes case-controls study of breast cancer.]]> Here, we are posting a harmonized and imputed dataset of PLCO GWAS and exome data, consisting of all harmonizable PLCO genotype data from each completed scan of cancer cases and controls, as well as the key covariates of sex and participant ID. As PLCO is a prospective cohort, incident cancers and other diseases are occurring all of the time. It is therefore important that researchers use contemporary follow-up in order to precisely define cancer case/control status. Therefore, to use this data, researchers should obtain the genetic data from dbgap and in parallel obtain up-to-date data on cancer and other diseases through the PLCO Cancer Data Access System (CDAS): http://prevention.cancer.gov/major-programs/prostate-lung-colorectal/cancer-data-access-system. Also available in CDAS are a large variety of covariate and endpoints as well as published biomarker data, which can be used for both main-effect and gene x environment studies. Together, we believe that these data will serve as a helpful resource for the entire scientific community. This PLCO dataset contains data genotyped on Illumina GSA, Oncoarray and historical data on Illumina OmniExpress (OmniX), Omni2.5M (Omni25) and Omni5M (Omni5). Most of the platforms used in PLCO were run separately, processed and QCed at different times. GSA data was generated at CGR within a relatively short period. Oncoarray data was genotyped at CGR and multiple external Institutes. OmniX, Omni25 and Omni5M data was genotyped at CGR historically. Genotype data from OmniX and Omni25M was generated with different clustering files. All genotype data was prepared in the binary PLINK file format. All released data should be in GRCh37/hg19. Chip data generated within CGR have had internal QC measures (iterative 80% and 95% sample- and variant-level call rate filters) applied, but not more stringent pre-imputation MAF and HWE filtering; external data have inconsistent QC due to provenance. Samples present in multiple genotyping datasets are released in all applicable datasets with the same synchronized PLCO ID. All subjects were split and cleaned by GRAF ancestry (see below) before imputation. More specifically, imputed data from each platform was split into 7 ancestral groups (African+African American, East Asian+Other Asian, European, Hispanic1, Hispanic2, Other, South Asian) based on ancestry assignment using GRAF (https://github.com/ncbi/graf). TOPMED reference panel 5b was used for imputation with Michigan Imputation Server (https://imputationserver.sph.umich.edu). Pre-phasing using phased reference data from TOPMed release 5b was conducted using EAGLE 2.4 (doi: 10.1038/ng.3679). Imputation was conducted against the same reference panel using minimac4 (https://genome.sph.umich.edu/wiki/Minimac4). Due to the limitation of sample size allowed by Michigan Imputation Server, the GSA/European dataset was imputed by splitting to 4 different batches. Each platform/ancestry pair was cleaned according to the filtering method in https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008500. Briefly, all variants with Rsq < 0.3 are removed to be consistent with traditional quality filters on MACH-style output. Then, the remaining variants are partitioned into minor allele frequency (MAF) bins {[0,0.0005], (0.0005,0.002], (0.002,0.005], (0.005,0.01], (0.01,0.03], (0.03,0.05], (0.05, 0.5]}. Variants in each bin are filtered out, starting at the lowest Rsq, until the average Rsq of remaining variants within the corresponding MAF bin is at least 0.9 (the Kowalski et al. citation suggests 0.8; the use of a more stringent threshold has no impact on common variation).For the nested case-control methylation analysis, 1680 samples were run on the MethylationEPIC.v1 BeadChip array, including 806 breast cancer cases and 825 controls frequency matched on age at random assignment (5-year intervals), and fiscal year of randomization (pre/post 10/1/1997). All controls were alive and had no history of cancer as of the date of diagnosis for the matched case. The data also includes 18 internal controls and the rest of the samples are duplicates for quality control.]]> Much of the data from these studies has been published and posted to dbGAP in the past. However, such data was posted as part of published projects that nearly always included PLCO data that was intermixed with other studies. Furthermore, the PLCO data from each of these studies was genotyped across a range of Illumina genotyping platforms. Thus, prior to this posting, it has been impossible to use all of the data from PLCO as one cohesive unit.]]>
创建时间:
2023-10-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作