Simulated Data - Part 3

DataONE2014-07-25 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.

本研究共模拟了十组重复的家畜数据结构。该数据集结构旨在覆盖多样的数量性状基因座（Quantitative Trait Locus, QTL）分布、亲缘关系结构与单核苷酸多态性（Single Nucleotide Polymorphism, SNP）密度，并模拟基因组选择应用中的部分典型场景。借助马尔可夫溯祖模拟器（Markovian Coalescence Simulator, MaCS）（Chen等，2009），本研究针对每组重复模拟了30条染色体的序列数据，每条染色体对应4000条基础单倍型。30条染色体的长度均为100厘摩（cM），总长约10^8个碱基对；模拟过程中设置单位点突变率为2.5×10^-8，最终世代的有效群体大小（Ne）为100。针对前序世代的有效群体大小缩减情况，本研究设置1000年前的Ne为1256、10000年前的Ne为4350、100000年前的Ne为43500，世代间的Ne变化呈线性趋势，该设置参考了Villa-Angulo等（2009）对荷斯坦奶牛群体的研究估算结果。本研究模拟了包含10个世代的家系结构：每个世代设置50头公畜，每头公畜搭配10头母畜，每头母畜繁育2只后代。家系中的奠基个体其配子通过随机抽取序列模拟得到的4000条单倍型生成，并依据遗传距离设置重组概率：每1厘摩的重组发生概率为1%；后续世代个体的配子则通过孟德尔遗传并结合重组过程生成。最终得到的全基因组总分离位点数量约为1670000个。研究人员从序列数据中随机选取60000个分离位点，用作60K SNP芯片的分型标记。此外，研究人员还从序列数据中随机选取9000个分离位点作为候选QTL位点，并采用两种不同的筛选方式：其一为完全随机抽取的位点集，其二为限制次要等位基因频率不超过0.30的随机位点集。本研究基于加性遗传模型，共模拟了4个不同的性状。前两个性状基于9000个无限制条件的候选QTL位点生成：第一个性状（PolyUnres）的每个QTL位点的等位基因替代效应，均从均值为0、标准差为1的正态分布中随机抽取；第二个性状（GammaUnres）则先从候选QTL位点中随机抽取900个位点，每个位点的等位基因替代效应从形状参数为0.4、尺度参数为1.66的伽马分布中抽取（Meuwissen等，2001），且效应值正负的概率各为50%。后两个性状（PolyRes与GammaRes）的生成方式与前两个性状一致，仅候选QTL位点替换为前述次要等位基因频率不超过0.30的9000个位点。为每个性状模拟了遗传力为0.25的表型数据。为确保四个性状的遗传力保持恒定，残差方差需基于奠基世代个体育种值的方差进行缩放；奠基世代个体育种值的方差计算公式为a'a/(n-1)，其中a为奠基世代个体的育种值向量，n为该世代的个体总数。每种情景均设置10组重复模拟。训练集与验证集：本研究从整体数据中提取子集分别作为训练集与验证集。训练集包含第4和第5世代的2000个个体；共设置3个验证集：第一个验证集（Gen6）从第6世代中随机抽取500个个体，第二个（Gen8）从第8世代中随机抽取500个个体，第三个（Gen10）从第10世代中随机抽取500个个体。

创建时间：

2014-07-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集