five

Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning

收藏
DataCite Commons2024-09-25 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/light-weight-ensemble-q-network-joint-implicit-constraints-offline-reinforcement-learning
下载链接
链接失效反馈
官方服务:
资源简介:
Offline reinforcement learning aims to learn policies from a limited dataset without interacting with the environment. However, the restricted nature of the dataset limits the agent's understanding of the environment, leading to out-of-distribution (OOD) behavior and extrapolation errors. Conventional research can be categorized into four main approaches: Q-value penalties, policy constraints, uncertainty estimation, and importance sampling. Most existing methods impose overly strict penalties. Therefore, this paper proposes an algorithm that encourages agents to explore unknown state-action pairs, relying on precise evaluations of OOD actions.First, to address the challenge of assessing Q-values for OOD actions, we discuss the equivalence of uncertainty quantification based on an ensemble of Q-function networks to avoid the additional computational overhead associated with simulating OOD sampling. Furthermore, due to fitting errors inherent in neural networks and the inability to effectively leverage relevant reward information, methods such as behavior cloning struggle to learn better policies.We propose an approach that utilizes a high-confidence Q function derived from uncertainty quantification to encourage agents to exploit bad datasets, while implicitly constraining policies and enhancing policy improvements. Specifically, we map the behavioral process into Q-space, thereby constraining the learning policy while guiding the policy selection towards high-confidence and high-Q-value OOD actions based on the gradient of the prior Q-function. This enables policy constraints to effectively utilize reward information and enhances algorithm performance by addressing fitting errors. Ultimately, we develop two algorithm variants, SF (SCORE-FAST) and SB (SCORE-BETTER). Theoretical analysis and experimental results demonstrate that SF achieves high performance with rapid convergence, while SB attains state-of-the-art performance.Our code will be published at https://github.com/dksen/SF-SB/tree/main. If interested, please contact the corresponding author.
提供机构:
IEEE DataPort
创建时间:
2024-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作