Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

要約

この調査では、マルチターン会話設定の大規模な言語モデル（LLM）ベースのエージェントの評価方法を調べます。
Prismaにインスパイアされたフレームワークを使用して、250近くの学術源を体系的にレビューし、出版物のさまざまな場所から最先端を獲得し、分析のための強固な基盤を確立しました。
私たちの研究は、2つの相互に関連した分類システムを開発することにより、構造化されたアプローチを提供します。1つは\ emph {何を評価するか}を定義し、もう1つは\ empheを説明する{評価方法}を説明します。
最初の分類法は、マルチターン会話と、タスクの完了、応答品質、ユーザーエクスペリエンス、メモリ、コンテキスト保持、計画とツールの統合など、マルチターン会話とその評価の次元のためのLLMベースのエージェントの重要なコンポーネントを識別します。
これらのコンポーネントは、会話エージェントのパフォーマンスが全体的かつ意味のある方法で評価されることを保証します。
2番目の分類システムは、評価方法に焦点を当てています。
アプローチは、注釈ベースの評価、自動化されたメトリック、人間の評価と定量的尺度を組み合わせたハイブリッド戦略、およびLLMを利用する自己判断方法に分類します。
このフレームワークは、BLEUやルージュスコアなどの言語理解から派生した従来の指標を捉えているだけでなく、マルチターン対話の動的でインタラクティブな性質を反映する高度なテクニックも組み込まれています。

要約(オリジナル)

This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.

arxiv情報

著者	Shengyue Guan,Haoyi Xiong,Jindong Wang,Jiang Bian,Bin Zhu,Jian-guang Lou
発行日	2025-03-28 14:08:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー