TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

要約

タスク指向のダイアログ（TOD）システムは、大規模な言語モデル（LLM）によって推進される革命を経験していますが、これらのシステムの評価方法論は、洗練度の高まりには不十分です。
従来の自動メトリックは以前のモジュラーシステムを効果的に評価しましたが、対話レベルのみに焦点を当てており、ユーザーエージェントのインタラクション中に発生する可能性のある重要な中間エラーを検出することはできません。
このホワイトペーパーでは、TD-Eval（ターンおよびダイアログレベルの評価）を紹介します。これは、全体的な対話レベルの比較で微細なターンレベル分析を統合する2段階の評価フレームワークです。
ターンレベルでは、会話の結束、バックエンドの知識の一貫性、およびポリシーコンプライアンスの3つのTOD固有の次元に沿って各応答を評価します。
一方、ペアワイズ比較を使用して対話レベルの品質を提供するTodエージェントアリーナを設計します。
Multiwoz 2.4および{\ tau} -benchの実験を通じて、TD-Valが従来の指標が見逃している会話エラーを効果的に識別することを実証します。
さらに、TD-Evalは、従来のLLMベースのメトリックよりも、人間の判断とより良い整合性を示しています。
これらの調査結果は、TD-EvalがTODシステム評価のための新しいパラダイムを導入し、将来の研究のためのプラグアンドプレイフレームワークでターンレベルとシステムレベルの両方を効率的に評価することを示しています。

要約(オリジナル)

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.

arxiv情報

著者	Emre Can Acikgoz,Carl Guo,Suvodip Dey,Akul Datta,Takyoung Kim,Gokhan Tur,Dilek Hakkani-Tür
発行日	2025-04-28 16:57:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー