Reward Learning with Intractable Normalizing Functions

要約

ロボットは、人間が何を最適化しているかを推測することで、人間を模倣することを学習できます。
このための一般的なフレームワークの 1 つはベイジアン報酬学習です。この学習では、ロボットは人間のデモンストレーションと修正を、その根底にある報酬関数の観察として扱います。
残念ながら、この推論は二重に理解しがたいものです。ロボットは、人間が提供した可能性のあるすべての軌道と、人間が念頭に置いている可能性のあるすべての報酬を推論しなければなりません。
以前の研究では、既存のロボットツールを使用してこのノーマライザーを近似しました。
この論文では、以前のアプローチを 3 つの基本的なクラスにグループ化し、それらのアプローチの理論的な長所と短所を分析します。
次に、統計コミュニティの最近の研究を活用して、連続空間における人間の報酬を漸近的に学習するモンテカルロ法である Double MH 報酬学習を導入します。
Double MH を、条件付きで独立した設定 (人間による各修正が完全に別個のものとして見られる) および条件付きで依存する環境 (人間の現在の修正が以前の入力に基づいて構築される場合がある) に拡張します。
シミュレーションとユーザー調査を通じて、私たちが提案するアプローチは、デモンストレーションまたは修正のいずれかから学習する場合、代替近似よりも正確に人間の報酬パラメーターを推測します。
ここでビデオをご覧ください: https://youtu.be/EkmT3o5K5ko

要約(オリジナル)

Robots can learn to imitate humans by inferring what the human is optimizing for. One common framework for this is Bayesian reward learning, where the robot treats the human’s demonstrations and corrections as observations of their underlying reward function. Unfortunately, this inference is doubly-intractable: the robot must reason over all the trajectories the person could have provided and all the rewards the person could have in mind. Prior work uses existing robotic tools to approximate this normalizer. In this paper, we group previous approaches into three fundamental classes and analyze the theoretical pros and cons of their approach. We then leverage recent research from the statistics community to introduce Double MH reward learning, a Monte Carlo method for asymptotically learning the human’s reward in continuous spaces. We extend Double MH to conditionally independent settings (where each human correction is viewed as completely separate) and conditionally dependent environments (where the human’s current correction may build on previous inputs). Across simulations and user studies, our proposed approach infers the human’s reward parameters more accurately than the alternate approximations when learning from either demonstrations or corrections. See videos here: https://youtu.be/EkmT3o5K5ko

arxiv情報

著者	Joshua Hoegerman,Dylan P. Losey
発行日	2023-05-16 16:59:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reward Learning with Intractable Normalizing Functions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー