M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

要約

複数の当事者による会話におけるソーシャルシグナルを理解することは、人間とロボットのインタラクションや人工ソーシャルインテリジェンスにとって重要です。
複数の当事者によるインタラクションには、体のポーズ、頭のポーズ、スピーチなどの社会的シグナルや、食事の際に食べ物を手に入れて一口食べるなどのコンテキスト固有のアクティビティが含まれます。
すべてのマルチモーダルなシグナルを複数の当事者間の対話に組み込むことは困難であり、これまでの研究では、ソーシャルシグナルを予測するためのタスク固有のモデルを構築する傾向がありました。
この研究では、単一のモデルで複数のパーティ設定におけるマルチモーダルなソーシャルシグナルを予測するという課題に取り組みます。
我々は、モダリティと時間的ブロック単位の注意マスキングを備えた因果変換アーキテクチャである M3PT を導入します。これにより、複数の参加者とその時間的相互作用にわたる複数の社会的手がかりの同時処理が可能になります。
このアプローチは、個人間の社会的シグナルの長期的な視野を考慮することで、長期にわたる社会的ダイナミクスをより適切に捕捉します。
私たちは、Human-Human Commensality Dataset (HHCD) に基づいて統合モデルをトレーニングおよび評価し、複数のモダリティを使用することで咬合タイミングと発話状態の予測が向上することを実証しました。
ソースコード: https://github.com/AbrarAnwar/masked-social-signals/

要約(オリジナル)

Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Multi-party interactions include social signals like body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Incorporating all the multimodal signals in a multi-party interaction is difficult, and past work tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking which allows for the simultaneous processing of multiple social cues across multiple participants and their temporal interactions. This approach better captures social dynamics over time by considering longer horizons of social signals between individuals. We train and evaluate our unified model on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/

arxiv情報

著者	Yiming Tang,Abrar Anwar,Jesse Thomason
発行日	2025-01-23 06:42:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー