SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

要約

大規模な言語モデル（LLM）エージェントは、実際のタスクでマルチターン相互作用を実行する必要があります。
ただし、LLMエージェントを最適化するための既存のマルチターンRLアルゴリズムは、LLMの一般化機能を活用しながら、複数回転にわたって効果的なクレジット割り当てを実行できず、そのようなアルゴリズムを開発する方法は不明のままです。
これを研究するために、最初に新しいベンチマークであるColbenchを紹介します。Colbenchでは、LLMエージェントが複数ターンで人間の協力者と対話し、バックエンドプログラミングとフロントエンドデザインの現実的なタスクを解決します。
このベンチマークに基づいて、慎重に設計された最適化目標を使用して、追加のトレーニング時間情報にアクセスできる批評家モデルをトレーニングする新しいRLアルゴリズムであるSweet-RL（トレーニング時間情報からの段階的な評価を備えたRL）を提案します。
批評家は、ポリシーモデルを改善するためのステップレベルの報酬を提供します。
私たちの実験は、Sweet-RLが他の最先端のマルチターンRLアルゴリズムと比較して、Colbenchの成功と勝利率の6％の絶対的な改善を達成することを示しており、Llama-3.1-8Bが現実的な共同コンテンツの作成におけるGPT4-Oのパフォーマンスに合わせたり、それを超えることができます。

要約(オリジナル)

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

arxiv情報

著者	Yifei Zhou,Song Jiang,Yuandong Tian,Jason Weston,Sergey Levine,Sainbayar Sukhbaatar,Xian Li
発行日	2025-03-19 17:55:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー