LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

要約

最小限の人間の努力で効率的でカスタマイズ可能な表現力豊かな行動の獲得を可能にするロボット学習の新しいフレームワークである、大規模な言語モデル支援優先予測（LAPP）を紹介します。
報酬エンジニアリング、人間のデモンストレーション、モーションキャプチャ、または高価なペアワイズ優先ラベルに大きく依存する以前のアプローチとは異なり、LAPPは大規模な言語モデル（LLM）をレバレッジして、強化学習中に収集された生の状態アクション軌跡から自動的に優先ラベルを生成します（RL）。
これらのラベルは、オンライン選好予測子をトレーニングするために使用されます。これにより、人間が提供する高レベルの行動仕様を満たすためのポリシー最適化プロセスを導きます。
私たちの主な技術貢献は、軌道レベルの優先予測を介してLLMSをRLフィードバックループに統合し、ロボットが歩行パターンやリズミカルなタイミングの微妙な制御を含む複雑なスキルを獲得できるようにすることです。
多様な一連の四足運動と器用な操作タスクのLAPPを評価し、効率的な学習、最終的なパフォーマンスの向上、より速い適応、および高レベルの動作の正確な制御を達成することを示します。
特に、LAPPにより、ロボットは、標準のLLM生成または手作りの報酬の手の届かないままである4倍のバックフリップなど、非常にダイナミックで表現力のあるタスクを習得できます。
私たちの結果は、スケーラブルな選好駆動型のロボット学習の有望な方向としてLappを強調しています。

要約(オリジナル)

We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.

arxiv情報

著者	Pingcheng Jian,Xiao Wei,Yanbaihui Liu,Samuel A. Moore,Michael M. Zavlanos,Boyuan Chen
発行日	2025-04-21 22:46:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー