$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

要約

会話型AIエージェントの既存のベンチマークは、AIエージェントのみがツールを使用して世界と対話できる一方で、ユーザーはパッシブ情報プロバイダーのままである単一制御環境をシミュレートします。
これは、ユーザーが（共有）世界の状態の変更に積極的に参加する必要があるテクニカルサポートのような実際のシナリオとは異なります。
このギャップに対処するために、4つの重要な貢献を備えた$ \ tau^2 $ -benchを導入します。1）dec-pomdpとしてモデル化された新しいテレコムデュアルコントロールドメイン。
カバレッジと複雑さの制御、3）環境と密接に結合した信頼できるユーザーシミュレーターは、その動作がツールと観察可能な状態によって制約され、シミュレーションの忠実度を改善し、4）推論対コミュニケーション/調整から生じるエラーの分離を含む複数のアブレーションによるエージェントパフォーマンスの細かい分析。
特に、私たちの実験では、エージェントがユーザーからデュアルコントロールに移行し、ユーザーを導く課題を強調したときに、大幅なパフォーマンス低下が示されます。
全体として、$ \ tau^2 $ -benchは、効果的に理由とユーザーアクションを導く必要があるエージェント向けの制御されたテストベンチを提供します。

要約(オリジナル)

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

arxiv情報

著者	Victor Barres,Honghua Dong,Soham Ray,Xujie Si,Karthik Narasimhan
発行日	2025-06-09 17:52:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー