Trajectory Improvement and Reward Learning from Comparative Language Feedback

要約

人間のフィードバックから学ぶことは、近年ロボット工学や自然言語処理などの分野で注目を集めています。
これまでの研究は主に比較という形で人間によるフィードバックに依存していましたが、言語はユーザーの好みについてより有益な洞察を提供する好ましい手段です。
この研究では、比較言語フィードバックを組み込んでロボットの軌道を反復的に改善し、人間の好みをエンコードする報酬関数を学習することを目指しています。
この目標を達成するために、私たちは軌跡データと言語フィードバックを統合する共有潜在空間を学習し、その後、学習した潜在空間を活用して軌跡を改善し、人間の好みを学習します。
私たちの知る限り、比較言語フィードバックを報酬学習に組み込んだのは私たちが初めてです。
私たちのシミュレーション実験は、学習された潜在空間の有効性と学習アルゴリズムの成功を実証しています。
また、人間を対象とした研究も行っており、報酬学習アルゴリズムが好みに基づく報酬学習と比較して平均で 23.9% 高い主観スコアを達成し、時間効率が 11.3% 高いことが示されており、この手法の優れたパフォーマンスが強調されています。
私たちのウェブサイトは https://liralab.usc.edu/comparative- language-フィードバック/ です。

要約(オリジナル)

Learning from human feedback has gained traction in fields like robotics and natural language processing in recent years. While prior works mostly rely on human feedback in the form of comparisons, language is a preferable modality that provides more informative insights into user preferences. In this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and to learn reward functions that encode human preferences. To achieve this goal, we learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. To the best of our knowledge, we are the first to incorporate comparative language feedback into reward learning. Our simulation experiments demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. We also conduct human subject studies that show our reward learning algorithm achieves a 23.9% higher subjective score on average and is 11.3% more time-efficient compared to preference-based reward learning, underscoring the superior performance of our method. Our website is at https://liralab.usc.edu/comparative-language-feedback/

arxiv情報

著者	Zhaojing Yang,Miru Jun,Jeremy Tien,Stuart J. Russell,Anca Dragan,Erdem Bıyık
発行日	2024-10-08 22:15:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Trajectory Improvement and Reward Learning from Comparative Language Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー