five

Scalable model-free feature screening via sliced-Wasserstein dependency

收藏
DataCite Commons2024-02-12 更新2024-08-18 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Scalable_model-free_feature_screening_via_sliced-Wasserstein_dependency/22148855/1
下载链接
链接失效反馈
官方服务:
资源简介:
We consider the model-free feature screening problem that aims to discard non-informative features before downstream analysis. Most of the existing feature screening approaches have at least quadratic computational cost with respect to the sample size <i>n</i>, thus may suffer from a huge computational burden when <i>n</i> is large. To alleviate the computational burden, we propose a scalable model-free sure independence screening approach. This approach is based on the so-called sliced-Wasserstein dependency, a novel metric that measures the dependence between two random variables. Specifically, we quantify the dependence between two random variables by measuring the sliced-Wasserstein distance between their joint distribution and the product of their marginal distributions. For a predictor matrix of size <i>n</i> × <i>d</i>, the computational cost for the proposed algorithm is at the order of O(n log (n)d), even when the response variable is multivariate. Theoretically, we show the proposed method enjoys both sure screening and rank consistency properties under mild regularity conditions. Numerical studies on various synthetic and real-world datasets demonstrate the superior performance of the proposed method in comparison with mainstream competitors, requiring significantly less computational time.

本文研究无模型特征筛选(model-free feature screening)问题,其核心目标是在下游分析阶段前剔除无信息特征。现有绝大多数特征筛选方法关于样本量<i>n</i>的计算复杂度至少为二次阶,因此当样本量较大时,往往会承受沉重的计算负担。为缓解这一计算瓶颈,本文提出一种可扩展的无模型确定性独立筛选(sure independence screening)方法。该方法基于切片瓦瑟斯坦相关性(sliced-Wasserstein dependency)——一种用于衡量两个随机变量间相关性的新型度量指标。具体而言,本文通过计算两个随机变量的联合分布与其边缘分布乘积之间的切片瓦瑟斯坦距离(sliced-Wasserstein distance),来量化二者的相关性。对于尺寸为<i>n</i>×<i>d</i>的预测变量矩阵,即便响应变量为多变量形式,所提算法的计算复杂度也仅为O(n log n d)阶。理论层面,本文证明在温和正则条件下,所提方法同时具备确定性筛选与秩一致性两大性质。针对各类合成数据集与真实世界数据集开展的数值实验表明,与主流对比方法相比,所提方法性能更优,且计算耗时显著更低。
提供机构:
Taylor & Francis
创建时间:
2023-02-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作