five

Distributed Learning for Principal Eigenspaces without Moment Constraints

收藏
DataCite Commons2024-05-24 更新2024-08-19 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Distributed_Learning_for_Principal_Eigenspaces_without_Moment_Constraints/25594260
下载链接
链接失效反馈
官方服务:
资源简介:
Distributed Principal Component Analysis (PCA) has been studied to deal with the case when data are stored across multiple machines and communication cost or privacy concerns prohibit the computation of PCA in a central location. However, the sub-Gaussian assumption in the related literature is restrictive in real application where outliers or heavy-tailed data are common in areas such as finance and macroeconomics. In this article, we propose a distributed algorithm for estimating the principal eigenspaces without any moment constraints on the underlying distribution. We study the problem under the elliptical family framework and adopt the sample multivariate Kendall’s tau matrix to extract eigenspace estimators from all submachines, which can be viewed as points in the Grassmann manifold. We then find the “center” of these points as the final distributed estimator of the principal eigenspace. We investigate the bias and variance for the distributed estimator and derive its convergence rate which depends on the effective rank, eigengap of the scatter matrix and the number of submachines. We show that the distributed estimator performs as if we have full access to the whole data. Simulation studies show that the distributed algorithm performs comparably with the existing one for light-tailed data, while showing great advantages for heavy-tailed data. We also extend the distributed algorithm to cases with limited communication constraints and with elliptical factor structure. Thorough simulation studies and a real application to a macroeconomic dataset verify the advantages of the proposed distributed algorithms. Supplementary materials for this article are available online.

分布式主成分分析(Distributed Principal Component Analysis, PCA)旨在解决数据存储于多台机器,且因通信成本或隐私顾虑无法在中心化位置完成主成分分析计算的场景。然而,现有相关文献中的次高斯假设(sub-Gaussian assumption)在实际应用中具有较强局限性:在金融与宏观经济学等领域,异常值与厚尾数据(heavy-tailed data)本就十分常见。本文提出一种无需对基础分布施加任何矩约束的分布式主特征子空间估计算法。本文在椭圆族框架(elliptical family framework)下展开研究,采用样本多元肯德尔τ矩阵(sample multivariate Kendall’s tau matrix)从各子机器中提取特征子空间估计量,这些估计量可视为格拉斯曼流形(Grassmann manifold)上的点。随后,我们通过求解这些点的“中心”得到最终的分布式主特征子空间估计量。本文分析了该分布式估计量的偏差与方差,并推导了其收敛速率——该速率取决于有效秩(effective rank)、散布矩阵(scatter matrix)的特征间隙(eigengap)以及子机器的数量。研究表明,该分布式估计量的表现等价于可完整获取全量数据的情形。仿真实验结果显示,针对轻尾数据(light-tailed data),所提分布式算法的性能与现有算法相当;而针对厚尾数据,其优势显著。本文还将该分布式算法拓展至有限通信约束以及椭圆因子结构(elliptical factor structure)场景。大量仿真实验与一项宏观经济数据集(macroeconomic dataset)的真实应用验证了所提分布式算法的优越性。本文补充材料(supplementary materials)可在线获取。
提供机构:
Taylor & Francis
创建时间:
2024-04-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作