Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning
收藏DataCite Commons2024-06-04 更新2024-08-18 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Doubly_Robust_Interval_Estimation_for_Optimal_Policy_Evaluation_in_Online_Learning/24525589/1
下载链接
链接失效反馈官方服务:
资源简介:
Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.
对正在实施的策略进行性能评估,在医学、经济学等诸多领域均发挥着核心作用,可为在线实验的提前终止以及来自环境的及时反馈提供关键决策依据。在线学习中的策略评估通过实时推断最优策略的平均结果(即策略价值),正受到越来越多的关注。然而,由于在线环境下生成的相依数据、未知的最优策略,以及自适应实验中复杂的探索-利用(exploration-exploitation)权衡问题,该问题极具挑战性。本文旨在解决在线学习策略评估领域的上述难题。我们针对常用的老虎机算法(bandit algorithms),明确推导了探索概率——该指标可量化在算法中选择非最优动作的探索概率。我们利用该概率对每个动作下的在线条件均值估计量开展有效统计推断,并提出双稳健区间估计(doubly robust interval estimation,简称DREAM)方法,以在线学习场景下推断估计最优策略对应的策略价值。所提出的价值估计量可为估计一致性提供双重保障,且服从渐近正态分布,可构建基于Wald型的置信区间。我们通过大量模拟实验与真实数据应用,验证了所提出的DREAM方法的经验有效性。
提供机构:
Taylor & Francis
创建时间:
2023-11-08



