Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method

要約

強化学習アルゴリズムは、環境との反復的な相互作用を通じて最適な制御戦略を学ぶことを目的としています。
このプロセスの重要な要素は、過去のエクスペリエンスを保存するエクスペリエンスリプレイバッファーです。これにより、アルゴリズムは、最新の相互作用ではなく、多様な相互作用から学習できます。
このバッファーは、経験が限られている動的環境で特に不可欠です。
ただし、トレーニングを加速するために高価値のエクスペリエンスを効率的に選択することは課題です。
生物系における報酬予測エラー（RPE）の役割からインスピレーションを得て、適応行動と学習に不可欠な生物学的システムにおける、報酬予測エラー優先体験リプレイ（RPE-PER）を導入します。
この新しいアプローチは、RPEに基づいてバッファーでの経験を優先します。
私たちの方法は、標準的な批評家ネットワークによって生成されたQ値に加えて報酬を予測する批評家ネットワークであるEMCNを採用しています。
これらの予測された報酬と実際の報酬の矛盾は、RPEとして計算され、経験の優先順位付けのシグナルとして利用されます。
さまざまな連続制御タスクにわたる実験的評価は、ベースラインアプローチと比較して、オフポリシーアクタークリティックアルゴリズムの学習速度とパフォーマンスを向上させる際のRPE-PERの有効性を示しています。

要約(オリジナル)

Reinforcement Learning algorithms aim to learn optimal control strategies through iterative interactions with an environment. A critical element in this process is the experience replay buffer, which stores past experiences, allowing the algorithm to learn from a diverse range of interactions rather than just the most recent ones. This buffer is especially essential in dynamic environments with limited experiences. However, efficiently selecting high-value experiences to accelerate training remains a challenge. Drawing inspiration from the role of reward prediction errors (RPEs) in biological systems, where they are essential for adaptive behaviour and learning, we introduce Reward Predictive Error Prioritised Experience Replay (RPE-PER). This novel approach prioritises experiences in the buffer based on RPEs. Our method employs a critic network, EMCN, that predicts rewards in addition to the Q-values produced by standard critic networks. The discrepancy between these predicted and actual rewards is computed as RPE and utilised as a signal for experience prioritisation. Experimental evaluations across various continuous control tasks demonstrate RPE-PER’s effectiveness in enhancing the learning speed and performance of off-policy actor-critic algorithms compared to baseline approaches.

arxiv情報

著者	Hoda Yamani,Yuning Xing,Lee Violet C. Ong,Bruce A. MacDonald,Henry Williams
発行日	2025-01-30 02:09:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー