Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

要約

強化学習 (RL) は、包括的で正確な報酬関数を設計することが難しいため、複雑なゲームタスク内のポリシーの軌道を評価する際に課題に直面しています。
この固有の困難により、さまざまな制約を特徴とするゲーム環境内での RL の広範な適用が妨げられます。
好みに基づく強化学習 (PbRL) は、人間の好みを重要な報酬シグナルとして利用する先駆的なフレームワークを提供し、それによって綿密な報酬エンジニアリングの必要性を回避します。
ただし、人間の専門家から嗜好データを取得することは、特に複雑な制約がある条件下ではコストがかかり、非効率的です。
この課題に取り組むために、私たちは LLM4PG という名前の LLM 対応自動優先生成フレームワークを提案します。これは、大規模言語モデル (LLM) の機能を利用して、軌跡を抽象化し、優先順位をランク付けし、報酬関数を再構築して条件付きポリシーを最適化します。
複雑な言語制約を持つタスクの実験では、LLM 対応の報酬関数の有効性が実証され、RL の収束が加速され、元の報酬構造での進捗の遅さや不在によって引き起こされる停滞が克服されました。
このアプローチは、専門的な人間の知識への依存を軽減し、野生の複雑な環境における RL の有効性を高める LLM の可能性を実証します。

要約(オリジナル)

Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks due to the difficulty in designing comprehensive and precise reward functions. This inherent difficulty curtails the broader application of RL within game environments characterized by diverse constraints. Preference-based reinforcement learning (PbRL) presents a pioneering framework that capitalizes on human preferences as pivotal reward signals, thereby circumventing the need for meticulous reward engineering. However, obtaining preference data from human experts is costly and inefficient, especially under conditions marked by complex constraints. To tackle this challenge, we propose a LLM-enabled automatic preference generation framework named LLM4PG , which harnesses the capabilities of large language models (LLMs) to abstract trajectories, rank preferences, and reconstruct reward functions to optimize conditioned policies. Experiments on tasks with complex language constraints demonstrated the effectiveness of our LLM-enabled reward functions, accelerating RL convergence and overcoming stagnation caused by slow or absent progress under original reward structures. This approach mitigates the reliance on specialized human knowledge and demonstrates the potential of LLMs to enhance RL’s effectiveness in complex environments in the wild.

arxiv情報

著者	Zichao Shen,Tianchen Zhu,Qingyun Sun,Shiqi Gao,Jianxin Li
発行日	2024-07-01 03:32:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー