clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

要約

命令チューニングされた大手言語モデル（LLMS）の出現により、ダイアログシステムの分野が進歩し、現実的なユーザーシミュレーションと堅牢なマルチターン会話エージェントの両方が可能になりました。
ただし、既存の研究では、単一のユーザーシミュレータまたはアーキテクチャと構成全体の洞察の一般化可能性を設計する特定のシステムに焦点を当てて、これらのコンポーネントを単独で評価します。
この作業では、一貫した条件下でダイアログシステムを体系的に評価するための柔軟なフレームワークであるクレムトッド（タスク指向のダイアログシステム開発のためのチャット最適化LLMS）を提案します。
Clem Toddは、文献からの既存のモデルであろうと、新しく開発されたモデルであろうと、ユーザーシミュレーターとダイアログシステムの組み合わせを介した詳細なベンチマークを可能にします。
プラグアンドプレイの統合をサポートし、均一なデータセット、評価メトリック、および計算上の制約を保証します。
この統合されたセットアップ内で既存のタスク指向のダイアログシステムを再評価し、3つの新たに提案されたダイアログシステムを同じ評価パイプラインに統合することにより、クレムトッドの柔軟性を紹介します。
私たちの結果は、アーキテクチャ、スケール、および促進戦略が対話のパフォーマンスにどのように影響するかについての実用的な洞察を提供し、効率的かつ効果的な会話型AIシステムを構築するための実用的なガイダンスを提供します。

要約(オリジナル)

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

arxiv情報

著者	Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
発行日	2025-05-08 17:36:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー