MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

要約

事前トレーニングされたビジョントランスフォーマーの最近の進歩により、音声の事前トレーニングなしでパラメータ効率の高い視聴覚学習が可能になることが示されています。
しかし、パラメータ効率の高いオーディオビジュアルトランスフォーマーのマルチモーダル機能を調整するための効果的な方法を調査した研究はほとんどありません。
この論文では、対応するマルチモーダルセマンティック特徴に対する深いモダリティアライメントを採用した新しいパラメータ効率の高いオーディオビジュアル変換器である MA-AVT を提案します。
具体的には、凍結されたモダリティ共有トランスフォーマーを使用して 2 つのモダリティを調整するための、ユニモーダルおよびマルチモーダルの共同トークン学習を導入します。
これにより、モデルは各モダリティの個別の表現を学習しながら、それらの間のクロスモーダル関係にも注意を払うことができます。
さらに、ユニモーダルエンコーダの出力から粗い特徴のみを調整する以前の研究とは異なり、ブロックごとの対照学習を導入して、エンコードフェーズ全体を通じて粗い粒度から細かい粒度までの階層的特徴を調整します。
さらに、前景に一致する視聴覚特徴から各モダリティの背景特徴を抑制するために、堅牢な識別前景マイニングスキームを導入します。
ベンチマーク AVE、VGGSound、CREMA-D データセットでの広範な実験を通じて、SOTA メソッドと比較して大幅なパフォーマンスの向上を達成しました。

要約(オリジナル)

Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods.

arxiv情報

著者	Tanvir Mahmud,Shentong Mo,Yapeng Tian,Diana Marculescu
発行日	2024-06-07 13:35:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー