No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

要約

混雑したシーンで誰が話しているかを認識することは、内部で行われている社会的相互作用を理解するための重要な課題です。
体の動きだけで発話状態を検出することで、個人の音声が得られない社交シーンの分析に道が開けます。
ビデオとウェアラブルセンサーにより、目立たず、プライバシーを保護する方法で話していることを認識できます。
ビデオモダリティを考慮すると、アクション認識の問題では、従来、境界ボックスを使用してターゲットサブジェクトのローカライズとセグメント化を行い、その中で行われているアクションを認識します。
ただし、クロスコンタミネーション、オクルージョン、および人体の多関節の性質により、混雑したシーンではこのアプローチが困難になります。
ここでは、被写体のローカリゼーションとその後の音声検出段階で、多関節体のポーズを活用します。
ポーズのキーポイント周辺の局所特徴の選択は、一般化のパフォーマンスにプラスの効果をもたらすと同時に、考慮される局所特徴の数を大幅に減らし、より効率的な方法になることを示します。
被験者の視点が異なる 2 つの野生のデータセットを使用して、この効果における相互汚染の役割を調査します。
さらに、同じタスクに対してウェアラブルセンサーで測定された加速度を利用し、両方の方法を組み合わせたマルチモーダルアプローチを提示します。

要約(オリジナル)

Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.

arxiv情報

著者	Jose Vargas-Quiros,Laura Cabrera-Quiros,Hayley Hung
発行日	2022-11-01 15:55:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー