Adapting Image-based RL Policies via Predicted Rewards

要約

画像ベースの強化学習 (RL) は、トレーニングと展開の間に視覚環境が大幅に変化する場合、一般化において大きな課題に直面します。
このような状況では、学習されたポリシーが適切に機能せず、結果の低下につながる可能性があります。
この問題に対するこれまでのアプローチは、主に、データ拡張やドメインのランダム化などの手法を使用して、トレーニング観測の分布を拡大することに焦点を当てていました。
ただし、RL 意思決定問題の逐次的性質を考慮すると、学習されたポリシーモデルによって残留エラーが伝播し、軌跡全体に蓄積され、パフォーマンスが大幅に低下することがよくあります。
この論文では、ドメインシフトの下で予測される報酬は、たとえ不完全であっても、微調整を導くための有用なシグナルとなり得るという観察を活用します。
このプロパティを利用して、ターゲットドメインでの報酬予測を使用してポリシーを微調整します。
ドメインが大幅にシフトした場合でも、予測された報酬は依然として意味のあるシグナルを提供でき、微調整により元のポリシーが大幅に改善されることがわかりました。
予測報酬微調整 (PRFT) と呼ばれる私たちのアプローチは、シミュレートされたベンチマークと現実世界の実験の両方で、さまざまなタスク全体のパフォーマンスを向上させます。
詳細については、プロジェクトの Web ページ: https://sites.google.com/view/prft をご覧ください。

要約(オリジナル)

Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between training and deployment. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this problem have largely focused on broadening the training observation distribution, employing techniques like data augmentation and domain randomization. However, given the sequential nature of the RL decision-making problem, it is often the case that residual errors are propagated by the learned policy model and accumulate throughout the trajectory, resulting in highly degraded performance. In this paper, we leverage the observation that predicted rewards under domain shift, even though imperfect, can still be a useful signal to guide fine-tuning. We exploit this property to fine-tune a policy using reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the original policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments. More information is available at project web page: https://sites.google.com/view/prft.

arxiv情報

著者	Weiyao Wang,Xinyuan Fang,Gregory D. Hager
発行日	2024-07-23 21:08:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adapting Image-based RL Policies via Predicted Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー