Beyond Words: Multimodal LLM Knows When to Speak

要約

大規模な言語モデル（LLM）ベースのチャットボットは、コヒーレントでコンテキストに関連する応答を生成する際に強力な機能を実証していますが、特に進行中の会話中に短いタイムリーな反応を提供する際に、いつ話すかを理解することに苦労することがよくあります。
この制限は、主にテキスト入力への依存から生じ、現実世界の人間の対話における豊富な文脈的キューが欠けています。
この作業では、ビジョン、オーディオ、テキスト全体の微妙なマルチモーダルシグナルに依存する短い反応的な発話に重点を置いて、応答タイプのリアルタイム予測に焦点を当てています。
これをサポートするために、一時的に整列した視覚的、聴覚、およびテキストストリームを含む、実際の会話ビデオから構築された新しいマルチモーダルデータセットを紹介します。
このデータセットにより、ダイアディック相互作用における応答タイミングの細かいモデリングが可能になります。
このデータセットに基づいて、視覚、聴覚、およびテキストのコンテキストを適応的に統合して応答を予測するマルチモーダルLLMベースのモデルであるMM-When2Speakを提案します。
実験では、MM-When2Speakが最先端のUnimodalおよびLLMベースのベースラインを大幅に上回り、主要な商用LLMよりも応答タイミングの精度を最大4倍改善することが示されています。
これらの結果は、タイムリーで自然な、魅力的な会話型AIを生産するためのマルチモーダル入力の重要性を強調しています。

要約(オリジナル)

While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

arxiv情報

著者	Zikai Liao,Yi Ouyang,Yi-Lun Lee,Chen-Ping Yu,Yi-Hsuan Tsai,Zhaozheng Yin
発行日	2025-05-20 17:42:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Words: Multimodal LLM Knows When to Speak

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー