Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

要約

ジェスチャーは対面コミュニケーションに不可欠な要素です。
それらは時間の経過とともに展開し、多くの場合、準備、発作、収縮という予測可能な動作段階に続きます。
しかし、自動ジェスチャ検出に対する一般的なアプローチは、問題をバイナリ分類として扱い、セグメントをジェスチャを含むか含まないかのいずれかに分類するため、本質的に連続的で文脈に応じた性質を捉えることができません。
これに対処するために、タスクを二項分類ではなく多相シーケンスのラベル付け問題として再構成する新しいフレームワークを導入します。
私たちのモデルは、時間枠に沿って骨格の動きのシーケンスを処理し、Transformer エンコーダーを使用してコンテキストの埋め込みを学習し、条件付きランダムフィールドを活用してシーケンスのラベル付けを実行します。
私たちは、タスク指向の対面対話における多様な共同スピーチジェスチャーの大規模なデータセットに基づいて提案を評価します。
結果は、ジェスチャストロークの検出において、私たちの方法が強力なベースラインモデルよりも大幅に優れていることを一貫して示しています。
さらに、Transformer エンコーダを適用して動きシーケンスからコンテキストの埋め込みを学習すると、ジェスチャユニットの検出が大幅に向上します。
これらの結果は、共同音声ジェスチャーフェーズのきめ細かいダイナミクスを捕捉するフレームワークの能力を強調し、より微妙で正確なジェスチャー検出と分析への道を開きます。

要約(オリジナル)

Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.

arxiv情報

著者	Esam Ghaleb,Ilya Burenko,Marlou Rasenberg,Wim Pouw,Peter Uhrig,Judith Holler,Ivan Toni,Aslı Özyürek,Raquel Fernández
発行日	2024-04-23 15:19:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー