Training Agents with Weakly Supervised Feedback from Large Language Models

要約

大規模言語モデル (LLM) は、反復的な環境相互作用を通じて複雑なタスクに取り組むことができるエージェントを作成するための有望な基盤を提供します。
既存の手法では、これらのエージェントが専門家が提供する軌道を模倣するか、強化学習のための最終的な環境フィードバックに依存する必要があるため、その適用はゲームやコード生成などの特定のシナリオに限定されます。
この論文では、批評家 LLM からの弱く監視された信号を使用し、専門家の軌跡や最終的なフィードバックの必要性を回避する、LLM ベースのエージェントの新しいトレーニング方法を紹介します。
私たちのエージェントは反復的な方法でトレーニングされており、最初は環境との相互作用を通じて軌道を生成します。
その後、批評家 LLM が良好な軌道のサブセットを選択します。これはエージェントの更新に使用され、エージェントが次の反復で改善された軌道を生成できるようにします。
API バンクデータセットに対する広範なテストにより、パラメーターがはるかに少ないオープンソースモデルを使用しているにもかかわらず、エージェントの機能が一貫して向上し、GPT-4 と同等のパフォーマンスが得られることがわかりました。

要約(オリジナル)

Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents’ capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.

arxiv情報

著者	Dihong Gong,Pu Lu,Zelong Wang,Meng Zhou,Xiuqiang He
発行日	2024-11-29 08:47:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Training Agents with Weakly Supervised Feedback from Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー