Overcoming Reward Model Noise in Instruction-Guided Reinforcement Learning

要約

視覚言語モデル (VLM) は、報酬がまばらな環境でより有益な報酬信号を提供する補助報酬モデルとして注目を集めています。
しかし、私たちの研究により、この方法の重大な脆弱性が明らかになりました。報酬信号内の少量のノイズにより、エージェントのパフォーマンスが大幅に低下する可能性があります。
報酬がまばらな困難な環境では、適切なノイズ処理を行わずに VLM ベースの報酬モデルを使用する強化学習エージェントのパフォーマンスが、探索駆動型の手法のみに依存するエージェントよりも劣ることを示します。
私たちは、偽陽性報酬（報酬モデルが与えられた指示を満たさない軌道に誤って報酬を割り当てる場合）は、偽陰性報酬よりも学習に悪影響を与えると仮説を立てています。
私たちの分析はこの仮説を裏付けており、広く使用されているコサイン類似度メトリクスをエージェントの軌跡と言語指示の比較に適用すると、偽陽性の報酬信号が生成される傾向があることが明らかになりました。
これに対処するために、新しいノイズ耐性のある報酬関数である BiMI (Binary Mutual Information) を導入します。
私たちの実験では、BiMI がエージェントのパフォーマンスを大幅に向上させ、学習された非オラクル VLM を使用したさまざまな環境全体で平均 44.5% の改善率を示し、それによって VLM ベースの報酬モデルが現実世界のアプリケーションで実用的になることが実証されました。

要約(オリジナル)

Vision-language models (VLMs) have gained traction as auxiliary reward models to provide more informative reward signals in sparse reward environments. However, our work reveals a critical vulnerability of this method: a small amount of noise in the reward signal can severely degrade agent performance. In challenging environments with sparse rewards, we show that reinforcement learning agents using VLM-based reward models without proper noise handling perform worse than agents relying solely on exploration-driven methods. We hypothesize that false positive rewards — where the reward model incorrectly assigns rewards to trajectories that do not fulfill the given instruction — are more detrimental to learning than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric, when applied to comparing agent trajectories and language instructions, is prone to generating false positive reward signals. To address this, we introduce BiMI (Binary Mutual Information), a novel noise-resilient reward function. Our experiments demonstrate that, BiMI significantly boosts the agent performance, with an average improvement ratio of 44.5\% across diverse environments with learned, non-oracle VLMs, thereby making VLM-based reward models practical for real-world applications.

arxiv情報

著者	Sukai Huang,Nir Lipovetzky,Trevor Cohn
発行日	2024-09-24 09:45:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Overcoming Reward Model Noise in Instruction-Guided Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー