Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling

NIAID Data Ecosystem2026-03-09 收录

下载链接：

https://figshare.com/articles/dataset/Analysis_of_Sequence_Data_Under_Multivariate_Trait_Dependent_Sampling/1301951

下载链接

链接失效反馈

官方服务：

资源简介：

High-throughput DNA sequencing allows for the genotyping of common and rare variants for genetic association studies. At the present time and for the foreseeable future, it is not economically feasible to sequence all individuals in a large cohort. A cost-effective strategy is to sequence those individuals with extreme values of a quantitative trait. We consider the design under which the sampling depends on multiple quantitative traits. Under such trait-dependent sampling, standard linear regression analysis can result in bias of parameter estimation, inflation of Type I error, and loss of power. We construct a likelihood function that properly reflects the sampling mechanism and uses all available data. We implement a computationally efficient EM algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. Our methods can be used to perform separate inference on each trait or simultaneous inference on multiple traits. We pay special attention to gene-level association tests for rare variants. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study and the National Heart, Lung, and Blood Institute Exome Sequencing Project. Supplementary materials for this article are available online.

高通量DNA测序（High-throughput DNA sequencing）可用于遗传关联研究中常见与罕见变异的基因分型。当前及可预见的未来，对大型队列中的所有个体进行测序在经济上并不可行。一种高成本效益的策略是对数量性状取值处于极端区间的个体开展测序。本文考量以多个数量性状为依据的抽样设计。在此类性状依赖抽样场景下，标准线性回归分析可能引发参数估计偏差、I型错误（Type I error）膨胀以及检验效能损失。我们构建了能够准确反映抽样机制且可利用所有可用数据的似然函数。我们实现了一种计算高效的期望最大化（EM）算法，并确立了所得极大似然估计量（maximum likelihood estimators）的理论性质。本方法可用于对单个性状开展独立推断，或对多个性状开展联合推断。我们特别聚焦于罕见变异的基因水平关联检验。通过大规模模拟研究，我们证实了所提方法相较于标准线性回归的优越性。我们将本方法应用于基因组流行病学心脏与衰老研究队列靶向测序研究（Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study）以及美国国家心脏、肺和血液研究所外显子组测序项目（National Heart, Lung, and Blood Institute Exome Sequencing Project）。本文的补充材料可在线获取。

创建时间：

2016-01-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集