Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

要約

報酬を計算することが高価な人間の補強学習または環境では、費用のかかる報酬は、学習効率を達成するのに挑戦することができます。
人間からフィードバックを取得したり、高価な報酬を計算するコストは、長いトレーニングセッションのすべての段階でフィードバックを受け取るアルゴリズムが実行不可能である可能性があり、エージェントの能力がパフォーマンスを効率的に改善する能力を制限する可能性があります。
私たちの目的は、人間の学習エージェントの依存や高価な報酬を減らし、学習ポリシーの質を維持しながら学習の効率を改善することです。
環境状態でのアクションの価値に関する知識が低い場合にのみ、報酬を要求する新しい強化学習アルゴリズムを提供します。
私たちのアプローチは、自信が高い場合、人間の配達または高価な報酬のプロキシとして報酬機能モデルを使用し、モデルの予測された報酬および/またはアクション選択に信頼が低い場合にのみ、それらの明示的な報酬を求めます。
費用のかかる報酬への依存を減らすことにより、報酬を取得するロジスティクスまたは費用がそれを禁止する可能性のある設定で効率的に学ぶことができます。
私たちの実験では、私たちのアプローチは、学習に必要なエピソードの数と報酬のわずか20％でそのパフォーマンスを達成するために必要なエピソードの数と、ベースラインに匹敵するパフォーマンスを取得します。

要約(オリジナル)

In human-in-the-loop reinforcement learning or environments where calculating a reward is expensive, the costly rewards can make learning efficiency challenging to achieve. The cost of obtaining feedback from humans or calculating expensive rewards means algorithms receiving feedback at every step of long training sessions may be infeasible, which may limit agents’ abilities to efficiently improve performance. Our aim is to reduce the reliance of learning agents on humans or expensive rewards, improving the efficiency of learning while maintaining the quality of the learned policy. We offer a novel reinforcement learning algorithm that requests a reward only when its knowledge of the value of actions in an environment state is low. Our approach uses a reward function model as a proxy for human-delivered or expensive rewards when confidence is high, and asks for those explicit rewards only when there is low confidence in the model’s predicted rewards and/or action selection. By reducing dependence on the expensive-to-obtain rewards, we are able to learn efficiently in settings where the logistics or expense of obtaining rewards may otherwise prohibit it. In our experiments our approach obtains comparable performance to a baseline in terms of return and number of episodes required to learn, but achieves that performance with as few as 20% of the rewards.

arxiv情報

著者	Muhammed Yusuf Satici,David L. Roberts
発行日	2025-02-28 15:58:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー