Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

要約

インタラクション品質の測定は、音声対話システムの改善にとって重要なタスクです。
ダイアログ品質を推定するための既存のアプローチは、個々のターンの品質を評価することに重点を置くか、対話の直後にエンドユーザーからダイアログレベルの品質測定値を収集するかのいずれかです。
これらのアプローチとは対照的に、Dialog Quality Annotation (DQA) と呼ばれる新しいダイアログレベルの注釈ワークフローを導入します。
DQA の専門アノテーターは、ダイアログの品質を全体として評価し、目標の完了やユーザーのセンチメントなどの属性についてダイアログにラベルを付けます。
この寄稿では、次のことを示します。(i) 対話の質を対話レベルの属性に完全に分解することはできませんが、いくつかの客観的な対話の属性と対話の質の判断の間には強い関係があります。
(ii) ダイアログレベルの品質推定のタスクでは、ダイアログレベルのアノテーションでトレーニングされた教師ありモデルは、純粋にターンレベルの特徴の集約に基づく方法よりも優れたパフォーマンスを発揮します。
(iii) 提案された評価モデルは、ベースラインと比較して優れた領域汎化能力を示しています。
これらの結果に基づいて、人間が注釈を付けた高品質のデータを持つことが、大規模な産業規模の音声アシスタントプラットフォームのインタラクション品質を評価する重要な要素であると主張します。

要約(オリジナル)

Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

arxiv情報

著者	Abishek Komma,Nagesh Panyam Chandrasekarasastry,Timothy Leffel,Anuj Goyal,Angeliki Metallinou,Spyros Matsoukas,Aram Galstyan
発行日	2023-06-09 01:17:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー