PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning

要約

オフラインからオンラインへの強化学習 (RL) は、オフラインの事前トレーニングとオンラインの微調整の利点を組み合わせることで、サンプル効率とポリシーのパフォーマンスの向上を約束します。
しかし、既存の方法は、そのままでは効果的ですが、次善のパフォーマンス、限られた適応性、および不十分な計算効率という問題があります。
我々は、標準的な RL 目標を反復的に進化する正則化項で強化することで、前述の制限を克服する新しいフレームワーク PROTO を提案します。
PROTO は信頼領域スタイルの更新を実行し、正則化項を徐々に進化させて制約の強さを緩和することで、安定した初期微調整と最適な最終パフォーマンスを実現します。
わずか数行のコードを調整するだけで、PROTO はオフラインポリシーの事前トレーニングと標準のオフポリシー RL 微調整を橋渡しして、強力なオフラインからオンラインへの RL 経路を形成し、多様な手法への優れた適応性を生み出すことができます。
シンプルでありながらエレガントな PROTO は、追加の計算を最小限に抑え、非常に効率的なオンライン微調整を可能にします。
広範な実験により、PROTO が SOTA ベースラインよりも優れたパフォーマンスを達成し、適応性があり効率的なオフラインからオンラインへの RL フレームワークを提供することが実証されました。

要約(オリジナル)

Offline-to-online reinforcement learning (RL), by combining the benefits of offline pretraining and online finetuning, promises enhanced sample efficiency and policy performance. However, existing methods, effective as they are, suffer from suboptimal performance, limited adaptability, and unsatisfactory computational efficiency. We propose a novel framework, PROTO, which overcomes the aforementioned limitations by augmenting the standard RL objective with an iteratively evolving regularization term. Performing a trust-region-style update, PROTO yields stable initial finetuning and optimal final performance by gradually evolving the regularization term to relax the constraint strength. By adjusting only a few lines of code, PROTO can bridge any offline policy pretraining and standard off-policy RL finetuning to form a powerful offline-to-online RL pathway, birthing great adaptability to diverse methods. Simple yet elegant, PROTO imposes minimal additional computation and enables highly efficient online finetuning. Extensive experiments demonstrate that PROTO achieves superior performance over SOTA baselines, offering an adaptable and efficient offline-to-online RL framework.

arxiv情報

著者	Jianxiong Li,Xiao Hu,Haoran Xu,Jingjing Liu,Xianyuan Zhan,Ya-Qin Zhang
発行日	2023-05-25 02:40:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー