Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

要約

大規模な言語モデル（LLM）は、質問の回答や対話などのタスクで優れていますが、交渉や説得などの相互作用を必要とする複雑なタスクには、追加の長期の推論と計画が必要です。
強化学習（RL）微調整は、原則としてそのような計画を可能にすることができますが、スケーラビリティを妨げる欠点に悩まされます。
特に、マルチターンRLトレーニングには高いメモリと計算コストが発生します。これは、LLMSをポリシーとしてトレーニングするときに悪化します。
さらに、最大のLLMは、そのような方法で訓練されるために必要なAPIを暴露しません。
その結果、LLMの推論を改善するための最新の方法は、RL微調整ではなく、洗練されたプロンプトメカニズムに依存しています。
これを改善するために、ゴールコンディショニングされた値関数を使用してLLMエージェントの推論を導く新しいアプローチを提案します。
これらの値関数は、アクションを与えられたタスクがどのように展開されるかを予測し、LLMエージェントが正と否定の両方の複数の可能な結果を効果的に計画できるようにします。
さらに、これらの値関数は、完全なアクションではなく推論ステップでトレーニングされ、マルチターン相互作用の意思決定を促進する簡潔で軽量のモジュールになります。
ツールの使用、ソーシャル控除、対話など、相互作用を必要とするタスクでの方法を検証し、効率とスケーラビリティを維持しながら、RLの微調整とプロンプトの両方の方法よりも優れたパフォーマンスを実証します。

要約(オリジナル)

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

arxiv情報

著者	Joey Hong,Anca Dragan,Sergey Levine
発行日	2025-05-23 16:51:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー