five

Genome-wide study of resistance to severe malaria in eleven populations (version 1)

收藏
NIAID Data Ecosystem2026-03-09 收录
下载链接:
https://www.malariagen.net/data_package/genome-wide-study-resistance-severe-malaria-eleven-populations-version-1/
下载链接
链接失效反馈
官方服务:
资源简介:
Background This data release contains SNP genotype data and association test results from our ongoing analysis of severe malaria in eleven populations. Data for three populations (Gambia, Malawi and Kenya) are available currently; additional populations will be added as they become available. If you use these data, please cite: Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature. 2015 Oct 8;526(7572):253-7. doi: 10.1038/nature15390 . This release contains two types of data: SNP genotype data. These data reflect genotyping of all samples on the Illumina Omni 2.5M array and are provided in VCF format. Addionally we provide the clinical status, gender, and sickle trait status of each sample, and information on quality control. Full details are provided below. Association test summary statistics. Identifiers, allele frequencies, imputation status and meta-analysis results for directly typed SNPs and variants imputed from the 1000 Genomes Project Phase 1 reference panel. For full details see the association test summaray statistics README file (95.6 KB) . These data have been deposited in the European Genome-phenome Archive under EGA Study ID: EGAS00001001311. All cases were diagnosed as meeting the WHO definition of severe malaria (see References [ 1-3 ]). Controls were samples from within the general population and from new births. Samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant .relationships.txt file (described below). The information provided here is common to each of the three population-specific datasets. For the association test summary statistics please see the separate README file (95.6 KB) . Data set structure Each data set contains a README file and a set of three data files: Samples Genotypes Quality Control (QC) information README files EGAS00001001311_Kenya_GWAS-2.5M_b37_releasenote.txt EGAS00001001311_Gambia_GWAS-2.5M_b37_releasenote.txt EGAS00001001311_Malawi_GWAS-2.5M_b37_releasenote.txt Samples Each data set includes three sample-related files: A sample file A sample metadata file A file with information about any familial relationships Sample files: samples/Kenya_GWAS-2.5M_b37.sample samples/Gambia_GWAS-2.5M_b37.sample samples/Malawi_GWAS-2.5M_b37.sample These are space-delimited files in a format suitable for use with the program SNPTEST, and contain information on the samples included in this study. Samples are identified both by a sample identifier and a chip assay identifier. Note that in some cases the same sample was genotyped multiple times, giving multiple chip IDs. The first row of this file gives column names. Columns are described below and in the file sample_metadata.csv. The second row of this file contains information on the type of values stored in the file, as follows: 0 - an identifier field D - a discrete or categorical field B - a binary (case/control) phenotype C - a continuous or numerical covariate Note that for some tools it may be necessary to rename the first two columns of this file as 'ID_1' and 'ID_2'. Columns in this file are as follows: chip_id - identifier for the chip assay sample_id - identifier for the DNA sample missing - not used in this dataset dataset - the name of the dataset plate - the id of the 96-well plate on which the sample was supplied for genotyping well - the well on the 96-well plate on which the sample was supplied for genotyping status - either 'CASE' (for severe malaria cases) or 'CONTROL' (for population controls). Please note that some samples have no control or malaria assignment. These are samples collected as parents of affected children (reported as 'PARENT') or samples with other designation (not reported here). Where applicable, family structure is described in the file Gambia_GWAS-2.5M_b37.relationships.txt (described below). There are also 3 HapMap samples (NA12878, NA12891, NA12892). severe_malaria - A binary (0/1) indicator of case/control status based on the status column above. We include this to simplify association testing. clinical_sex - Gender as reported on sample collection. M = male, F = female, NA = missing or unknown gender. estimated_sex - Gender as determined by comparison of assay intensities on the X and Y chromosomes. This is only provided for samples that passed QC thresholds. ethnicity - Reported ethnic group. Where maternal and paternal ethnic group differs, this is reported in the format '_MIXED'. Only ethnic information for the major ethnic groups (those comprising at least 5% of our sample) is provided. All other groups have been pooled together and labelled as "OTHER". rs334_genotype - Assayed HbS (rs334) genotype for each individual as typed on the Sequenom iPLEX platform. See URLs below for links to further details on this SNP. The genotype data for rs334 are provided with respect to the forward strand of the human reference sequence (T: Major allele/ancestral allele/reference allele and A: Minor allele/alternative allele/non-reference allele). Note that although this SNP is reported as multi-allelic in dbSNP, we have assayed only the segregating T and A alleles. The genome position with respect to GRCh37 is 11:5204808. Where we were unable to determine a genotype the data are represented by NA. PC1 to PC10 - The first 10 principal components used in [ 1 ] to control for population structure in genome-wide association analysis (GWAS). Missing values are set to NA; samples with missing values are those that were excluded from GWAS analyses in [ 1 ]; these samples also appear in the exclusion lists. Sample metadata file Each data package is accompanied by a sample metadata file: samples/sample_metadata.csv. This is a tab-separated file listing columns in the above sample file, and giving an abbreviated form of the above descriptions. This file may be useful for automated processing. Files reflecting family structure The samples in these datasets are nominally unrelated, with the exception of a small number of familial relationships detailed in the relevant .relationships.txt file. These files describe known blood (i.e. familial) relationships in this study, as reported in our clinical data. (These data contain a small number of trio and parent-child relationships.) samples/Kenya_GWAS-2.5M_b37.relationships.txt samples/Gambia_GWAS-2.5M_b37.relationships.txt samples/Malawi_GWAS-2.5M_b37.relationships.txt* * This file does not contain any information, as all samples in this data set are unrelated. Example format: Family Child Father Mother family_1 MLCP1_1M1300381 MLCP1_1M1424842 MLCP1_1M1424843 family_2 MLCP1_1M1300381 NA MLCP1_1M1424843 Genotypes A directory called ‘vcf’ contains the genotype data in per-chromosome files. Genotype files: vcf/Kenya_GWAS-2.5M_b37_chr??.vcf.gz vcf/Gambia_GWAS-2.5M_b37_chr??.vcf.gz vcf/Malawi_GWAS-2.5M_b37_chr??.vcf.gz Index files: vcf/Kenya_GWAS-2.5M_b37_chr??.vcf.gz.tbi vcf/Gambia_GWAS-2.5M_b37_chr??.vcf.gz.tbi vcf/Malawi_GWAS-2.5M_b37_chr??.vcf.gz.tbi Where ?? represents the chromosome number with zero-padded prefix. Genotype and normalised intensity data is provided in bgzipped VCF format. A tabix index (.tbi) file is provided with each vcf file. See the ‘Useful links’ section below for links to software that can be used to access these data. Column names in the VCF files refer to the chip_id in the sample information file described above, and appear in the same order as in that file. VCF files contain the following fields: GT - consensus genotype call, representing a consensus among three algorithms (Illuminus, GenoSNP, and Illumina GenCall). See [1,2] for full methodology. GLI - genotype call from Illuminus GLG - genotype call from GenoSNP GC - genotype call from Illumina's Gencall algorithm GCS - genotype call score from Illumina's Gencall algorithm XY - normalised assay intensity information for each SNP and each sample All chromosomes and positions in the files are in NCBI build 37/GRCh37 coordinates. All data was typed on the Illumina Omni 2.5M [quad/oct] platform using the [HumanOmni2.5-4v1_D/HumanOmni2.5-8v1_A] Illumina chip manifest, which is available from Illumina. Note that variant IDs and alleles in these files reflect the Name, IlmnID and SNP columns of the chip manifest. Quality control (QC) information Sample exclusions: exclusions/Kenya_GWAS-2.5M_b37_sample_exclusions.txt exclusions/Gambia_GWAS-2.5M_b37_sample_exclusions.txt exclusions/Malawi_GWAS-2.5M_b37_sample_exclusions.txt These files contain a list of samples that were excluded from our analysis due to QC criteria including missing call rate and heterozygosity, or as genetic duplicates. This file has two columns: the first reflects the Chip ID of the excluded samples, and the second indicates the reason for exclusion. Possible reasons for exclusion are 'quality' (excluded due to high missingness, outlying heterozygosity, or outlying average intensities), 'relatedness' (excluded due to high relatedness with another sample), 'technical' (excluded for technical reasons), or 'hapmap' (a hapmap sample). SNP exclusions: Kenya_GWAS-2.5M_b37_snp_exclusions.txt Gambia_GWAS-2.5M_b37_snp_exclusions.txt Malawi_GWAS-2.5M_b37_snp_exclusions.txt These files contain a list of SNPs that were excluded from our analysis during QC prior to imputation. This file has six columns reflecting the SNPID, rsid, chromosome, position and alleles of the excluded SNP. References This data was used in the following manuscripts: [1] Malaria Genomic Epidemiology Network. A novel locus of resistance to severe malaria in a region of ancient balancing selection. Nature, 2015;526(7572):253-7. DOI: 10.1038/nature15390 . [2] Band et al . Imputation-based meta-analysis of severe malaria in three African populations. PLOS Genetics, 2013; 8(10): e75675. DOI: 10.1371/journal.pgen.1003509 The following manuscript may also be of use in interpreting these data: [3] Rockett et al . Reappraisal of known malaria resistance loci in a large multicenter study. Nature Genetics, 2014; 46(11): 1197-204. DOI: 10.1038/ng.3107 Useful links File formats VCF format http://www.htslib.org/doc/vcf.html SNPTEST https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html SNPTEST file formats https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#input_file_formats More information on rs334 Ensembl genome browser http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=11:5226502-5227502;v=rs334;vdb=variation;vf=328 dbSNP http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs334 The following tools may be useful in manipulating the files contained in this data release: Vcftools https://vcftools.github.io/index.html tabix http://www.htslib.org/doc/ QCTOOL http://www.well.ox.ac.uk/~gav/qctool/#overview VariantAnnotation R package http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
创建时间:
2016-03-21
二维码
社区交流群
二维码
科研交流群
商业服务