Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning

Name: Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning
Creator: Chen, Yesen
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/light-weight-ensemble-q-network-joint-implicit-constraints-offline-reinforcement-learning

下载链接

链接失效反馈

官方服务：

资源简介：

Offline reinforcement learning aims to learn policies from a limited dataset without interacting with the environment. However, the restricted nature of the dataset limits the agent's understanding of the environment, leading to out-of-distribution (OOD) behavior and extrapolation errors. Conventional research can be categorized into four main approaches: Q-value penalties, policy constraints, uncertainty estimation, and importance sampling. Most existing methods impose overly strict penalties. Therefore, this paper proposes an algorithm that encourages agents to explore unknown state-action pairs, relying on precise evaluations of OOD actions.First, to address the challenge of assessing Q-values for OOD actions, we discuss the equivalence of uncertainty quantification based on an ensemble of Q-function networks to avoid the additional computational overhead associated with simulating OOD sampling. Furthermore, due to fitting errors inherent in neural networks and the inability to effectively leverage relevant reward information, methods such as behavior cloning struggle to learn better policies.We propose an approach that utilizes a high-confidence Q function derived from uncertainty quantification to encourage agents to exploit bad datasets, while implicitly constraining policies and enhancing policy improvements. Specifically, we map the behavioral process into Q-space, thereby constraining the learning policy while guiding the policy selection towards high-confidence and high-Q-value OOD actions based on the gradient of the prior Q-function. This enables policy constraints to effectively utilize reward information and enhances algorithm performance by addressing fitting errors. Ultimately, we develop two algorithm variants, SF (SCORE-FAST) and SB (SCORE-BETTER). Theoretical analysis and experimental results demonstrate that SF achieves high performance with rapid convergence, while SB attains state-of-the-art performance.Our code will be published at https://github.com/dksen/SF-SB/tree/main. If interested, please contact the corresponding author.

离线强化学习（Offline Reinforcement Learning）旨在从有限数据集中学习策略，且无需与环境交互。然而，数据集的固有局限性限制了智能体对环境的认知，进而引发分布外（Out-of-Distribution，OOD）行为与外推误差。现有常规研究主要可分为四大类方法：Q值惩罚、策略约束、不确定性估计以及重要性采样。多数现有方法施加了过于严苛的惩罚约束。为此，本文提出一种算法，该算法鼓励智能体探索未知的状态-动作对，并依托对分布外动作的精准评估实现探索目标。首先，为解决分布外动作的Q值评估难题，本文探讨了基于Q函数网络集成的不确定性量化的等价性，以此规避模拟分布外采样所带来的额外计算开销。此外，由于神经网络本身存在固有拟合误差，且无法有效利用相关奖励信息，行为克隆等方法难以学习到更优策略。本文提出的方法利用源自不确定性量化的高置信度Q函数，鼓励智能体利用欠佳的数据集，同时隐式约束策略并提升策略优化效果。具体而言，本文将行为过程映射至Q空间，在约束学习策略的同时，基于先验Q函数的梯度引导策略选择高置信度且高Q值的分布外动作。这使得策略约束能够有效利用奖励信息，并通过修正拟合误差提升算法性能。最终，本文提出两种算法变体：SF（SCORE-FAST）与SB（SCORE-BETTER）。理论分析与实验结果表明，SF可实现快速收敛且性能优异，而SB则达到了当前最优性能。本文代码将发布于 https://github.com/dksen/SF-SB/tree/main。如有兴趣，请联系通讯作者。

提供机构：

Chen, Yesen

5,000+

优质数据集

54 个

任务类型

进入经典数据集