Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward

要約

強化学習を使用したロボット制御が普及していますが、安全性と時間節約の理由から、その学習プロセスは通常、エピソードの途中で終了します。
この研究は、時間差分 (TD) 学習がそのような終了時に実行する最も一般的な例外処理の問題に対処します。
つまり、終了後に強制的にゼロ値を仮定することにより、通常状態での報酬設計に応じて、意図せず暗黙的な過小評価または過大評価が発生します。
タスクの失敗によりエピソードが終了した場合、その失敗が意図せず過大評価されてしまい、誤った方針が得られる可能性がある。
この問題は報酬設計に注意することで回避できますが、TD学習を実践する場合には終了時の例外処理の見直しが必須となります。
そこで本論文では、意図しない過大評価による学習失敗を回避するために、終了後の値を意図的に過小評価する手法を提案する。
また、終了時の定常度に応じて過小評価の度合いを調整することで、意図的な過小評価による過剰な探索を防止する。
シミュレーションと実際のロボット実験により、提案手法がさまざまなタスクと報酬設計に対して最適なポリシーを安定して取得できることが示されました。

要約(オリジナル)

Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally underestimate the value after termination to avoid learning failures due to the unintentional overestimation. In addition, the degree of underestimation is adjusted according to the degree of stationarity at termination, thereby preventing excessive exploration due to the intentional underestimation. Simulations and real robot experiments showed that the proposed method can stably obtain the optimal policies for various tasks and reward designs. https://youtu.be/AxXr8uFOe7M

arxiv情報

著者	Taisuke Kobayashi
発行日	2023-08-24 13:21:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー