Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning

Name: Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning
Creator: IEEE DataPort
Published: 2024-09-25 10:56:37
License: 暂无描述

DataCite Commons2024-09-25 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/light-weight-ensemble-q-network-joint-implicit-constraints-offline-reinforcement-learning

下载链接

链接失效反馈

官方服务：

资源简介：

Offline reinforcement learning aims to learn policies from a limited dataset without interacting with the environment. However, the restricted nature of the dataset limits the agent's understanding of the environment, leading to out-of-distribution (OOD) behavior and extrapolation errors. Conventional research can be categorized into four main approaches: Q-value penalties, policy constraints, uncertainty estimation, and importance sampling. Most existing methods impose overly strict penalties. Therefore, this paper proposes an algorithm that encourages agents to explore unknown state-action pairs, relying on precise evaluations of OOD actions.First, to address the challenge of assessing Q-values for OOD actions, we discuss the equivalence of uncertainty quantification based on an ensemble of Q-function networks to avoid the additional computational overhead associated with simulating OOD sampling. Furthermore, due to fitting errors inherent in neural networks and the inability to effectively leverage relevant reward information, methods such as behavior cloning struggle to learn better policies.We propose an approach that utilizes a high-confidence Q function derived from uncertainty quantification to encourage agents to exploit bad datasets, while implicitly constraining policies and enhancing policy improvements. Specifically, we map the behavioral process into Q-space, thereby constraining the learning policy while guiding the policy selection towards high-confidence and high-Q-value OOD actions based on the gradient of the prior Q-function. This enables policy constraints to effectively utilize reward information and enhances algorithm performance by addressing fitting errors. Ultimately, we develop two algorithm variants, SF (SCORE-FAST) and SB (SCORE-BETTER). Theoretical analysis and experimental results demonstrate that SF achieves high performance with rapid convergence, while SB attains state-of-the-art performance.Our code will be published at https://github.com/dksen/SF-SB/tree/main. If interested, please contact the corresponding author.

提供机构：

IEEE DataPort

创建时间：

2024-09-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集