five

Optimal Sampling for Generalized Linear Models Under Measurement Constraints

收藏
DataCite Commons2025-06-01 更新2024-07-28 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Optimal_Sampling_for_Generalized_Linear_Models_under_Measurement_Constraints/12448871/3
下载链接
链接失效反馈
官方服务:
资源简介:
Under “measurement constraints,” responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called “response-free.” We propose a response-free sampling procedure optimal sampling under measurement constraints (OSUMC) for generalized linear models. Using the A-optimality criterion, that is, the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the OSUMC algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives. We find that OSUMC’s performance is comparable to that of sampling algorithms that use complete responses. This shows that, provided an efficient algorithm such as OSUMC is used, there is little or no loss in accuracy due to the unavailability of responses because of measurement constraints. Supplementary materials for this article are available online.

在测量约束(measurement constraints)场景下,数据集中绝大多数样本的响应变量均存在测量成本高昂的问题,且初始状态下不可获取,但全数据集的协变量均为可用状态。本研究的目标是从数据集中抽取规模相对较小的样本子集,对该子集内的高成本响应变量进行测量,并使得最终得到的抽样估计量具备统计有效性。测量约束要求抽样概率仅能依赖于极少量的响应变量集合。若某抽样流程仅在小型预实验样本中使用响应变量,则将其定义为无响应(response-free)抽样方法。我们针对广义线性模型(generalized linear models),提出了一种满足测量约束的无响应抽样方法——测量约束下最优抽样(Optimal Sampling Under Measurement Constraints, OSUMC),简称OSUMC。采用A最优准则(即渐近方差的迹)作为优化目标时,所得到的估计量在一类抽样估计量中具备统计有效性。我们推导得到了一类通用无响应抽样估计量的无条件渐近分布。相较于现有基于协变量与响应变量联合条件化得到的条件式研究结果,该结论具备创新性。在我们提出的无条件分析框架下,子样本不再满足独立性假设,因此我们为该渐近理论开发了全新的鞅分析技术。我们进一步推导得到了A最优无响应抽样分布。由于该理论抽样分布依赖于总体层面的统计量,我们提出了OSUMC算法以近似该理论最优抽样方案。最后,我们开展了大量实证研究,从统计性能与计算效率两个维度验证了OSUMC算法相较于现有方法的优势。实验结果表明,OSUMC的性能可与使用完整响应变量的抽样算法相媲美。这表明,只要采用OSUMC这类高效算法,即可避免因测量约束导致响应变量不可获取所带来的精度损失(或仅存在可忽略的精度损失)。本文的补充材料可在线获取。
提供机构:
Taylor & Francis
创建时间:
2021-09-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作