A Set of Efficient Methods to Generate High-Dimensional Binary Data With Specified Correlation Structures
收藏DataCite Commons2024-02-16 更新2024-07-28 收录
下载链接:
https://tandf.figshare.com/articles/dataset/A_set_of_efficient_methods_to_generate_high-dimensional_binary_data_with_specified_correlation_structures/12896667
下载链接
链接失效反馈官方服务:
资源简介:
High-dimensional correlated binary data arise in many areas, such as observed genetic variations in biomedical research. Data simulation can help researchers evaluate efficiency and explore properties of different computational and statistical methods. Also, some statistical methods, such as Monte Carlo methods, rely on data simulation. Lunn and Davies proposed linear time complexity methods to generate correlated binary variables with three common correlation structures. However, it is infeasible to specify unequal probabilities in their methods. In this article, we introduce several computationally efficient algorithms that generate high-dimensional binary data with specified correlation structures and unequal probabilities. Our algorithms have linear time complexity with respect to the dimension for three commonly studied correlation structures, namely exchangeable, decaying-product and <i>K</i>-dependent correlation structures. In addition, we extend our algorithms to generate binary data of specified nonnegative correlation matrices satisfying the validity condition with quadratic time complexity. We provide an R package, CorBin, to implement our simulation methods. Compared to the existing packages for binary data generation, the time cost to generate a 100-dimensional binary vector with the common correlation structures and general correlation matrices can be reduced up to 10<sup>5</sup> folds and 10<sup>3</sup> folds, respectively, and the efficiency can be further improved with the increase of dimensions. The R package CorBin is available on CRAN at https://cran.r-project.org/.
高维相关二元数据广泛存在于诸多研究领域,例如生物医学研究中观测到的遗传变异。数据模拟可帮助研究人员评估各类计算与统计方法的效能,并探究其特性。此外,蒙特卡洛(Monte Carlo)方法等部分统计方法亦依赖数据模拟技术。Lunn与Davies曾提出具备线性时间复杂度的方法,用于生成具备三种常见相关结构的相关二元变量,但该类方法无法指定不等的边缘概率。本文介绍若干计算高效的算法,可生成具备指定相关结构与不等边缘概率的高维二元数据。针对三类常见研究的相关结构——即可交换、衰减乘积以及K阶相依相关结构,本文所提算法的时间复杂度关于数据维度呈线性关系。此外,本文将所提算法拓展至可生成满足合法性条件的指定非负相关矩阵对应的二元数据,其时间复杂度为二次方级别。我们开发了一款名为CorBin的R包以实现上述数据模拟方法。与现有的二元数据生成工具包相比,针对常见相关结构与一般相关矩阵生成100维二元向量时,本工具包的耗时可分别降低至多10⁵倍与10³倍,且随着维度提升,效率还可进一步优化。CorBin这款R包可在CRAN平台(https://cran.r-project.org/)获取。
提供机构:
Taylor & Francis
创建时间:
2020-08-31



