PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning

要約

好みに基づく強化学習 (RL) は、ロボット学習の新しい分野として登場しました。人間は、状態と動作のペアのさまざまなシーケンスについて好みを表現することで、ロボットの動作を形成する上で重要な役割を果たします。
ただし、ロボットの現実的なポリシーを策定するには、広範なクエリに対する人間の応答が必要です。
この研究では、クエリごとに収集される情報を拡張して、設定とオプションのテキストプロンプトの両方を含めることによって、サンプル効率の課題に取り組みます。
これを達成するために、大規模言語モデル (LLM) のゼロショット機能を活用して、人間が提供したテキストから推論します。
追加のクエリ情報に対応するために、柔軟なハイライト、つまり比較的高度な情報を含み、事前トレーニングされた LLM からゼロショット方式で処理された特徴に関連する状態とアクションのペアを含むように報酬学習目標を再定式化します。
シミュレートされたシナリオとユーザー調査の両方で、フィードバックとその影響を分析することで、作業の有効性を明らかにします。
さらに、収集された集合的なフィードバックは、シミュレートされたソーシャルナビゲーション環境で社会に準拠した軌道でロボットを訓練するのに役立ちます。
トレーニングされたポリシーのビデオ例を https://sites.google.com/view/rl-predilect で提供しています。

要約(オリジナル)

Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights — state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect

arxiv情報

著者	Simon Holk,Daniel Marta,Iolanda Leite
発行日	2024-02-23 16:30:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー