Data in support of "An exploration of linkage fine-mapping on sequences from case-control studies"

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6615457

下载链接

链接失效反馈

官方服务：

资源简介：

These data were simulated for an exploration of linkage fine-mapping on sequences from case-control studies. The scripts to generate and analyze the data are available at https://github.com/SFUStatgen/PBJ0. Queries may be directed to Payman Nickchi at pnickchi@sfu.ca or Charith (Bhagya) Karunarathna at ch757276@dal.ca. README file for All_data directory Directory structure The All_data directory consists of this README file and 500 sub-directories named DatasetX, for X=1 to 500. Within each DatasetX sub-directory are further sub-directories named alt and null containing files named pop_data.RData and sample_data.RData. alt versus null directories The files in the alt and null directories contain the same variant data but different phenotype data. In particular, under the null hypothesis, disease status is simulated at random according to a 5% prevalence in the population, whereas under the alternative hypothesis disease status is simulated according to a penetrance model that depends on causal SNVs. The R script to simulate data under the alternative hypothesis is in the file 1_SimulateData.R in the Github repository https://github.com/SFUStatgen/PBJ0. pop_data.RData and sample_data.RData files The data structures contained in the pop_data.RData and sample_data.RData files are described below. The structure is the same under both the null and alternative hypothesis. pop_data.RData From R, load("pop_data.RData") loads a list named pop_data whose elements describe the population’s haplotype and phenotype data. The list elements are as follows. Variants: a matrix of variants for the population of 6200 haplotypes rows are SNVs, columns are sequences Positions: a data frame of SNV positions rows are SNVs, column 1 is the SNV name and column 2 is the SNV position in base pairs Population.Mapping: a data frame telling us how the sequences are paired into individuals rows are individuals First column 1 is an individual ID from 1,…,3100; columns 2 and 3 are the sequence IDs of the first and second sequence for that individual where the sequence IDs are the column names of the Variants matrix. Genotype.Matrix: a matrix of genotypes (i.e. variant counts) for the 3100 individuals rows are SNVs columns are the individuals causal_region: a vector containing the lower- and upper-limit of the causal region in base pairs. cSNV: a vector containing the IDs of the causal SNVs, where the SNV IDs are the row names of the Variants matrix. DISCRETE: a list with the following elements. CaseIndividuals: vector of IDs of the affected individuals in the population. ControlIndividuals: vector of IDs of the unaffected in the population. BinaryTrait: a vector of trait status (0=unaffected, 1=affected) for each individual. Note: Within the same DatasetX directory, the only difference between the pop_data data structures under the null and alternative hypothesis is the phenotype information contained in their respective DISCRETE list elements. Both the null and alternative pop_data data structure share list elements: Variants, Positions, Population.Mapping, Genotype.Matrix, causal_region and cSNV. sample_data.RData From R, load("sample_data.RData") loads a list whose elements describe the sequences and phenotypes of the sample of 50 affected individuals (cases) and 50 unaffected individuals (controls) from the population. Haps: a list with two elements. sample_haps: a matrix of 200 sequences for the 50 cases and 50 controls. Rows are SNVs and columns are sequences, with the sequences of sampled cases appearing first (i.e. first 100 columns), followed by the sequences of sampled controls (i.e. last 100 columns). Sequences include only those SNVs that are polymorphic in the sample. ccStatus: a vector indicating the case/control status of the individual to which the sequence belongs, with case=1 and control=0. Genos: a list with two elements. sample_genos: a matrix of 100 genotypes for the 50 cases and 50 controls. Rows are SNVs and columns are genotypes, with genotypes of cases appearing first, followed by genotypes of controls. ccStatus: a vector indicating the case/control status of each individual, with case=1 and control=0. Posn: a data frame of SNV positions for each SNV that is polymorphic in the sample. The first column is the SNV name and the second is the SNV position in base pairs. Posn is a subset of pop_data$Positions. poly_cSNV: a vector of IDs for causal SNVs that are polymorphic in the sample. CaseIND: a vector of individual IDs for the case individuals (see pop_data$Population.Mapping). ControlIND: a vector of individual IDs for the control individuals (see pop_data$Population.Mapping). CaseHapID: a vector of IDs for the sequences that belong to cases (see the sequence IDs in the column names of the matrix pop_data$Variants). ControlHapID: a vector of IDs for the sequences that belong to controls (see the sequence IDs in the column names of the matrix pop_data$Variants).

创建时间：

2022-06-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集