WHI Harmonized and Imputed GWAS Data
收藏DataCite Commons2026-04-09 更新2026-05-04 收录
下载链接:
https://gen3.biodatacatalyst.nhlbi.nih.gov/discovery/phs000746.v3.p3.c2/
下载链接
链接失效反馈官方服务:
资源简介:
This substudy phs000746 WHI Imputation is the effort of a joint imputation and harmonization effort for many of the Genome Wide Association Studies (GWAS) within the Women's Health Initiative (WHI) Clinical Trials and Observational Studies. Summary level phenotypes for the WHI Cohort study participants can be viewed at the top-level study page [phs000200](./study.cgi?study_id=phs000200) WHI Cohort. Individual level phenotype data and molecular data for all WHI top-level study and substudies are available by requesting Authorized Access to the WHI Cohort study [phs000200](./study.cgi?study_id=phs000200).
These studies jointly involved over 30,000 samples, alignment ("flipping") to the same reference panel, imputation to the 1000 genomes, identification of genetically related individuals, and computations of principal components and comparison with self-reported ethnicity.
The harmonization/imputation effort involves 6 GWAS, as described in the table below.
<td valign="top"> <br/>(WHI Study #) </td> <td valign="top"> Hip Fracture GWAS<br/>(BA3) </td> <td valign="top"> [SHARe](./study.cgi?study_id=phs000386)<br/>(M5) </td> <td valign="top"> [GARNET](./study.cgi?study_id=phs000315)<br/>(M13) </td> <td valign="top"> [WHIMS+](./study.cgi?study_id=phs000675)<br/>(W63) </td> <td valign="top"> GECCO<br/>(AS224) </td> <td valign="top"> MOPMAP<br/>(AS264) </td>
Hip Fracture GWAS<br/>(BA3)
[GARNET](./study.cgi?study_id=phs000315)<br/>(M13)
GECCO<br/>(AS224)
<td valign="top"> Directly genotyped GWAS data on dbGaP </td> <td valign="top"> No </td> <td valign="top"> Yes </td> <td valign="top"> Yes </td> <td valign="top"> Data uploaded to dbGaP June 2013 </td> <td valign="top"> Yes </td> <td valign="top"> No </td>
No
Yes
Yes
<td valign="top"> Study that funded GWAS </td> <td valign="top"> WHI-BAA3 National Heart, Lung, and Blood Institute (NHLBI) </td> <td valign="top"> NHLBI </td> <td valign="top"> WHI-GARNET - National Human Genome Research Institute (NHGRI) </td> <td valign="top"> NHLBI </td> <td valign="top"> WHI-AS224 - National Cancer Institute (NCI) </td> <td valign="top"> WHI-AS264 - National Institute of Environmental Health Sciences (NIEHS) and Univ of North Carolina </td>
WHI-BAA3 National Heart, Lung, and Blood Institute (NHLBI)
WHI-GARNET - National Human Genome Research Institute (NHGRI)
WHI-AS224 - National Cancer Institute (NCI)
<td valign="top"> GWAS platform </td> <td valign="top"> Illumina 550K and 610K </td> <td valign="top"> Affymetrix 6.0 </td> <td valign="top"> Illumina HumanOmni1-Quad v1-0 B </td> <td valign="top"> HumanOmniExpress Exome-8v1_B </td> <td valign="top"> Illumina 610 and Cytochip 370K </td> <td valign="top"> Affymetrix Gene Titan, Axiom Genome-Wide Human CEU I Array Plate </td>
Illumina 550K and 610K
Illumina HumanOmni1-Quad v1-0 B
Illumina 610 and Cytochip 370K
<td valign="top"> Design </td> <td valign="top"> Case-control </td> <td valign="top"> Cohort </td> <td valign="top"> Case-control (4 case groups) </td> <td valign="top"> Cohort </td> <td valign="top"> Case-control </td> <td valign="top"> Case-Control </td>
Case-control
Case-control (4 case groups)
Case-control
<td valign="top"> Phenotype for cases </td> <td valign="top"> Hip Fracture </td> <td valign="top"> NA </td> <td valign="top"> Type 2 Diabetes, Myocardial Infarction, Stroke, Venous Thrombosis </td> <td valign="top"> NA </td> <td valign="top"> Colorectal cancer </td> <td valign="top"> Ventricular Ectopy (ever) </td>
Hip Fracture
Type 2 Diabetes, Myocardial Infarction, Stroke, Venous Thrombosis
Colorectal cancer
<td valign="top"> Other sample details </td> <td valign="top"> NA </td> <td valign="top"> Minorities </td> <td valign="top"> Hormone Therapy Clinical Trial </td> <td valign="top"> Hormone Therapy Clinical Trial </td> <td valign="top"> NA </td> <td valign="top"> Controls selected within centers, years, seasons and visit years in which cases originated </td>
NA
Hormone Therapy Clinical Trial
NA
<td valign="top"> Ethnicity </td> <td valign="top"> Mostly white </td> <td valign="top"> African American and Hispanic </td> <td valign="top"> Mostly white </td> <td valign="top"> White </td> <td valign="top"> White </td> <td valign="top"> White </td>
Mostly white
Mostly white
White
<td valign="top"> Sample size* </td> <td valign="top"> 3690 </td> <td valign="top"> 11992 </td> <td valign="top"> 4883 </td> <td valign="top"> 5687 </td> <td valign="top"> 2493 </td> <td valign="top"> 3069 </td>
3690
4883
2493
*The sample sizes are the number of samples after QC that are available on dbGaP. Note that there are some subjects that are in multiple studies, as detailed below.<br/> #For some of the data files these two platforms are considered different studies.
**Initial QC**
Initial QC had already been carried out on each of the GWAS studies, using the GENEVA protocol or protocols that were vey similar. Some of the pertinent QC parameters used for each of the studies are shown in the table below.
<td valign="top"> </td> <td valign="top"> Hip Fracture GWAS </td> <td valign="top"> SHARe </td> <td valign="top"> GARNET </td> <td valign="top"> WHIMS+ </td> <td valign="top"> GECCO </td> <td valign="top"> MOPMAP </td>
SHARe
WHIMS+
MOPMAP
<td valign="top"> Minimal sample call rate </td> <td valign="top"> 98% </td> <td valign="top"> 95% </td> <td valign="top"> 98% </td> <td valign="top"> 97% </td> <td valign="top"> 97% </td> <td valign="top"> 95% </td>
98%
98%
97%
<td valign="top"> Minimal SNP call rate </td> <td valign="top"> 98% </td> <td valign="top"> 90% </td> <td valign="top"> 98% </td> <td valign="top"> 98% </td> <td valign="top"> 98% </td> <td valign="top"> 90% </td>
98%
98%
98%
<td valign="top"> Hardy Weinberg P-value cut-off below which SNPs are excluded </td> <td valign="top"> 1e-4 </td> <td valign="top"> 1e-6 </td> <td valign="top"> 1e-4 </td> <td valign="top"> 1e-4 </td> <td valign="top"> 1e-4 </td> <td valign="top"> 1e-6 </td>
1e-4
1e-4
1e-4
<td valign="top"> Samples used for Hardy Weinberg calculations </td> <td valign="top"> Controls of European-ancestry </td> <td valign="top"> All samples, separate for Hispanics and African Americans </td> <td valign="top"> Unrelated controls of European-ancestry </td> <td valign="top"> All </td> <td valign="top"> Controls </td> <td valign="top"> All </td>
Controls of European-ancestry
Unrelated controls of European-ancestry
Controls
<td valign="top"> Minimum allele frequency cut-off </td> <td valign="top"> 1% </td> <td valign="top"> 1% </td> <td valign="top"> None </td> <td valign="top"> 1% </td> <td valign="top"> 5% </td> <td valign="top"> 0.5% </td>
1%
None
5%
**Imputation**
The imputation was done using the following procedures. - Match the strand of the GWAS data with the 1KGP data by comparing the letters of the alleles (ambiguous A/T or C/G SNPs were excluded). - We used the 1kGP reference panel (1092 samples; v2.20101123 for GECCO; v3.20101123 for GARNET, HIPFX, MOPMAP, WHIMS+). - The GWAS data were first split into chunks. Each chunk has 10000 SNPs and neighboring chunks have 1000 overlapping SNPs. Then all chunks were phased using Beagle and then combined using mergebeaglechunks.jar (available from the BEAGLE website). - An autoclip file was created for minimac to specify what the range of the chunks (start and stop) and the SNPs to be imputed within the chunk (core_start and core_end) so that no SNP needs to be be imputed twice. All chunks were imputed into 1kGP using minimac. - SNPs that could not be imputed with high enough confidence (cut-off R2>0.1) were omitted for that particular study (but still appear as columns of missing data in files if they were kept in the other studies, to facilitate alignment). - We did not impute the X chromosome.
The SHARe study was independently imputed to the same reference panel. The procedures used were similar to those listed above, except that MACH was used to carry out the imputation.
**Harmonization** - A panel of 5665 SNPs was used for checking the pairwise concordance among all samples in GARNET, GECCO, HIPFX, SHARe, WHIMS+ and MOPMAP. - The same panel of SNPs was used for principal component check together with HapMap samples to identify ethnicity outlier. - The same panel of SNPs were used for checking IBD in plink to identify relatedness among samples. - Another PC analysis was done for combined samples (after removing of ineligible duplicates) in all studies then the resulting PCs were mapped back to samples within each study. - A netcdf file of imputed results was created for each chromosome in each study. Different studies have the same set of SNPs - SNPs that were not successfully imputed in a particular study but are in other studies are listed as missing values. - A SNP info file was also created along with each netcdf file describing the SNP name, chromosome, position, count allele, alternative allele, count allele frequency, and imputation quality for each SNP.
**Duplicates**
As subjects for each of these GWAS were selected independently we checked for duplicates between the studies. We removed a small number of samples that - were supposed to be duplicates but had a concordance rate smaller than 90%; and - appeared duplicates but were from unrelated individuals, who appeared not to be monozygotic twins.
We kept samples that - were monozygotic twins (see relatedness below), and - duplicates between studies
in our data sets. There currently are 29846 unique subjects in the data.
**Relatedness**
We carried out an IBD analysis using a subset of 5665 SNPS and the Plink package. We used the results to identify 42 parent off-spring pairs and 303 pairs of siblings/first degree relatives. These, together with the 5 pairs of monozygotic twins are listed in the file *"WHI_GWAS_relatedness_information.csv"*, which lists all pairs of related individual. We did not identify second and higher degree relatives (e.g. cousins, half-sibs etc).
**Principal Component Analysis**
We carried out a principal component analysis using a subset of 5665 SNPs and the R package. The PCs are available in the subject information file (see below). In this file we identified subjects whose genetic ethnicity was inconsistent with the self-reported ethnicity on WHI Form 2. As an example, a subject who had self-identified herself as Asian, but clustered completely with the white (European American) subjects, would be labeled as being genetically inconsistent. Would this subject in a PC plot be located half-way the whites and Asians, she would not be labeled as inconsistent, as Form 2 only allowed one ethnicity, so the subject could have been Asian and white, choosing to identify herself as Asian. Note that WHI collected additional ethnicity information on Form 41. This information was not used when interpreting the PCs. Users are encouraged to compare information with this form, as in some instances it will help interpret data on admixed subjects or subjects whose PCs and Form 2 information is inconsistent.
提供机构:
NHLBI BioData Catalyst
创建时间:
2026-02-06



