A set of efficient methods to generate high-dimensional binary data with specified correlation structures
收藏DataCite Commons2024-02-16 更新2024-07-28 收录
下载链接:
https://tandf.figshare.com/articles/dataset/A_set_of_efficient_methods_to_generate_high-dimensional_binary_data_with_specified_correlation_structures/12896667/1
下载链接
链接失效反馈官方服务:
资源简介:
High dimensional correlated binary data arise in many areas, such as observed genetic variations in biomedical research. Data simulation can help researchers evaluate efficiency and explore properties of different computational and statistical methods. Also, some statistical methods, such as Monte-Carlo methods, rely on data simulation. Lunn and Davies (1998) proposed linear time complexity methods to generate correlated binary variables with three common correlation structures. However, it is infeasible to specify unequal probabilities in their methods. In this manuscript, we introduce several computationally efficient algorithms that generate high-dimensional binary data with specified correlation structures and unequal probabilities. Our algorithms have linear time complexity with respect to the dimension for three commonly studied correlation structures, namely exchangeable, decaying-product and <i>K</i>-dependent correlation structures. In addition, we extend our algorithms to generate binary data of specified non-negative correlation matrices satisfying the validity condition with quadratic time complexity. We provide an R package, CorBin, to implement our simulation methods. Compared to the existing packages for binary data generation, the time cost to generate a 100-dimensional binary vector with the common correlation structures and general correlation matrices can be reduced up to 10<sup>5</sup> folds and 10<sup>3</sup> folds, respectively, and the efficiency can be further improved with the increase of dimensions. The R package CorBin is available on CRAN at https://cran.r-project.org/.
高维相关二元数据广泛存在于诸多研究领域,例如生物医学研究中观测到的遗传变异。数据模拟可帮助研究人员评估各类计算与统计方法的效能,并探究其特性。此外,蒙特卡洛(Monte-Carlo)方法等部分统计方法亦依赖数据模拟技术。Lunn与Davies(1998)提出了线性时间复杂度方法,用于生成具备三种常见相关结构的相关二元变量,但该方法无法指定不等概率参数。本文提出多款计算高效的算法,可生成具备指定相关结构与不等概率参数的高维二元数据。针对三种主流研究的相关结构(即可交换结构、衰减乘积结构与K阶依赖结构),我们的算法关于数据维度具备线性时间复杂度。此外,我们将算法扩展至可生成满足有效性条件的指定非负相关矩阵对应的二元数据,此时算法时间复杂度为二次时间复杂度。我们开发了用于实现该模拟方法的R软件包CorBin。相较于现有的二元数据生成软件包,针对常见相关结构与一般相关矩阵,生成长度为100维的二元向量所需的时间成本分别可降低至原有的10^−5倍与10^−3倍,且随着维度提升,计算效率还可进一步优化。R软件包CorBin可在CRAN平台(https://cran.r-project.org/)获取。
提供机构:
Taylor & Francis
创建时间:
2020-08-31



