Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only

要約

オンライン強化学習（RL）を通じて、事前に訓練されたポリシーのパフォーマンスを改善することは、重要でありながら挑戦的なトピックです。
既存のオンラインRL微調整方法には、安定性とパフォーマンスのためにオフラインの前提条件のQ関数を使用した継続的なトレーニングが必要です。
ただし、これらのオフラインの事前に抑制されたQ機能は、一般に、ほとんどのオフラインRLメソッドの保守主義のためにオフラインデータセットを超えた状態アクションペアを過小評価しており、オフラインからオンライン設定への移行時のさらなる調査を妨げます。
さらに、この要件は、事前に訓練されたポリシーのみが利用可能であるが、事前に訓練されたQ関数が存在しないシナリオでの適用性を制限します。
これらの課題に対処するために、オフラインの事前訓練を受けたポリシーのみを使用して、効率的なオンラインRL微調整の方法を提案し、事前に訓練されたQ機能への依存を排除します。
有害な悲観論を避けるために、オンラインフェーズ中にQ機能をゼロから迅速に初期化するPORIR（ポリシーのみの強化学習微調整）を導入します。
私たちの方法は、事前にデータまたはポリシーを活用する高度なオフラインからオンラインのRLアルゴリズムとオンラインRLアプローチで競争力のあるパフォーマンスを達成するだけでなく、先駆者で直接微調整行動クローン（BC）ポリシーの新しいパスを開拓します。

要約(オリジナル)

Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.

arxiv情報

著者	Wei Xiao,Jiacheng Liu,Zifeng Zhuang,Runze Suo,Shangke Lyu,Donglin Wang
発行日	2025-05-22 16:14:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー