SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

要約

手話認識 (SLR) は、聴覚障害のあるコミュニティのコミュニケーションを促進する上で重要な役割を果たします。
SLR は、ビデオ全体に光沢の注釈が付けられる弱い監視タスクであるため、ビデオセグメント内で対応する光沢を特定することが困難になります。
最近の研究によると、SLR の主なボトルネックは、大規模なデータセットの利用が限られていることによって引き起こされるトレーニングが不十分であることです。
この課題に対処するために、私たちは、視覚とテキストの対比学習によって強化されたマルチモーダル連続手話認識フレームワークである SignVTCL を紹介します。これは、マルチモーダルデータと言語モデルの一般化能力の可能性を最大限に活用します。
SignVTCL は、マルチモーダルデータ (ビデオ、キーポイント、オプティカルフロー) を同時に統合して、統一されたビジュアルバックボーンをトレーニングすることで、より堅牢なビジュアル表現を実現します。
さらに、SignVTCL には、光沢レベルと文レベルの調整を組み込んだ視覚とテキストの調整アプローチが含まれており、個々の光沢と文のレベルで視覚的特徴と光沢の間の正確な対応を確保します。
Phoenix-2014、Phoenix-2014T、CSL-Daily の 3 つのデータセットに対して行われた実験結果は、SignVTCL が以前の方法と比較して最先端の結果を達成することを示しています。

要約(オリジナル)

Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.

arxiv情報

著者	Hao Chen,Jiaze Wang,Ziyu Guo,Jinpeng Li,Donghao Zhou,Bian Wu,Chenyong Guan,Guangyong Chen,Pheng-Ann Heng
発行日	2024-01-22 11:04:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー