Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

要約

ターンテイクは豊富にマルチモーダルです。
予測的なターンテイキングモデル（PTTM）は、自然主義的な人間とロボットの相互作用を促進しますが、ほとんどは発話のみに依存しています。
音声と表情、ヘッドポーズ、視線などの視覚的な手がかりを組み合わせたマルチモーダルPTTMであるMM-VAPを紹介します。
ビデオ会議の相互作用の最先端のオーディオのみを上回ることがわかります（84％対79％の保留/シフト予測の精度）。
すべてのホールドとシフトを集約する以前の作業とは異なり、私たちはターン間の沈黙の期間単位でグループ化します。
これは、視覚的な機能を含めることにより、MM-VAPがスピーカーの移行のすべての期間にわたって最先端のオーディオのみのターンテイキングモデルを上回ることを明らかにしています。
詳細なアブレーション研究を実施します。これは、表情の特徴がモデルのパフォーマンスに最も貢献することを明らかにしています。
したがって、私たちの作業仮説は、対話者が互いに見える場合、ターンテイクに視覚的な手がかりが不可欠であり、したがって正確なターンテーキング予測のために含める必要があるということです。
さらに、電話スピーチを使用したPTTMトレーニングの自動音声アライメントの適合性を検証します。
この作業は、マルチモーダルPTTMの最初の包括的な分析を表しています。
将来の仕事への影響について説明し、すべてのコードを公開します。

要約(オリジナル)

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

arxiv情報

著者	Sam O’Connor Russell,Naomi Harte
発行日	2025-05-27 11:24:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー