Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition

要約

手話は、ジェスチャー、表情、身体の動きを通して微妙な表現を可能にする、聴覚障害と頑固な（DHH）コミュニティのコミュニケーションの基本的な手段です。
DHH集団内での相互作用を促進する上での重要な役割にもかかわらず、聴覚集団間の手話の流encyさが限られているため、重大な障壁が持続します。
自動信号認識（SLR）を通じてこのコミュニケーションのギャップを克服することは、特に動的な単語レベルでの課題のままです。ここでは、時間的および空間的依存関係を効果的に認識する必要があります。
畳み込みニューラルネットワーク（CNN）はSLRで可能性を示していますが、それらは計算的に集中的であり、ビデオシーケンス間のグローバルな時間的依存関係をキャプチャするのが困難です。
これらの制限に対処するために、単語レベルのアメリカ手話（ASL）認識のビデオビジョントランス（VIVIT）モデルを提案します。
変圧器モデルは、自己触媒メカニズムを利用して、空間的および時間的次元全体でグローバルな関係を効果的にキャプチャし、複雑なジェスチャー認識タスクに適しています。
VideOMAMEEモデルは、WLASL100データセットで75.58％の上位1精度を達成し、65.89％の従来のCNNと比較してその強力なパフォーマンスを強調しています。
私たちの研究は、変圧器ベースのアーキテクチャがSLRを前進させ、コミュニケーションの障壁を克服し、DHHの個人を含めることを促進する大きな可能性を持っていることを示しています。

要約(オリジナル)

Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks (CNNs) have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.

arxiv情報

著者	Alexander Brettmann,Jakob Grävinghoff,Marlene Rüschoff,Marie Westhues
発行日	2025-04-11 06:59:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー