CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

要約

タイトル：CVT-SLR：Variational Alignmentを用いた手話認識のための対比的視覚的・テキスト変換

要約：手話認識（SLR）は、手話の動画をテキストで表した弱監督タスクです。大規模な手話データセットが不足しているため、SLRのトレーニングが不十分であることが主なボトルネックとなっています。本研究では、視覚的・言語的モジュールの事前学習知識を十分に活用するために、対比的視覚的・テキスト変換の新しい方法であるCVT-SLRを提案しています。本研究では、以下の2点に着目しています。

1. 言語的・視覚的モジュールから事前学習知識を取り入れるためのVAE

2. 対比的交差モーダルアライメントアルゴリズムによる一貫性制約の強化

PHOENIX-2014およびPHOENIX-2014Tの公共データセットを使用して行われた実験により、提案されたCVT-SLRが既存の単一キュー法を常に超え、SOTA複数キュー法をさらに上回ることが示された。

要約(オリジナル)

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

arxiv情報

著者	Jiangbin Zheng,Yile Wang,Cheng Tan,Siyuan Li,Ge Wang,Jun Xia,Yidong Chen,Stan Z. Li
発行日	2023-04-12 10:07:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー