Towards Online Multi-Modal Social Interaction Understanding

要約

マルチモーダルソーシャルインタラクション理解（MMSI）は、人間とロボットの相互作用システムで重要です。
実際のシナリオでは、AIエージェントはリアルタイムのフィードバックを提供する必要があります。
ただし、既存のモデルは、多くの場合、過去と将来のコンテキストの両方に依存しているため、実際の問題への適用を妨げます。
このギャップを埋めるために、オンラインMMSI設定を提案します。この設定では、モデルは、記録されたダイアログやビデオストリームなどの履歴情報のみを使用してMMSIタスクを解決する必要があります。
有用な将来のコンテキストを逃すことの課題に対処するために、2つの補完的な戦略を活用するオンラインMMSI-VLMという名前の新しいフレームワークを開発します。マルチパーティの会話予測とマルチモーダルの大手言語モデルを使用したソーシャルアウェアビジュアルプロンプトです。
第一に、言語の文脈を豊かにするために、マルチパーティの会話予測は、潜在的な将来の発話を粗から調整し、今後のスピーカーのターンを予測し、その後、細粒の会話の詳細を生成することをシミュレートします。
第二に、視線やジェスチャーなどの視覚的な社会的手がかりを効果的に組み込むために、ソーシャルアウェアの視覚的プロンプトは、各人とフレームの境界ボックスとボディキーポイントを備えたビデオのソーシャルダイナミクスを強調します。
3つのタスクと2つのデータセットでの広範な実験は、この方法が最新のパフォーマンスを達成し、ベースラインモデルを大幅に上回ることを示しており、オンラインMMSIに対する有効性を示しています。
コードモデルと事前に訓練されたモデルは、https：//github.com/sampson-lee/onlinemmsiで公開されます。

要約(オリジナル)

Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.

arxiv情報

著者	Xinpeng Li,Shijian Deng,Bolin Lai,Weiguo Pian,James M. Rehg,Yapeng Tian
発行日	2025-03-25 17:17:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Online Multi-Modal Social Interaction Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー