Enhancing Vision-Language Model with Unmasked Token Alignment

要約

CLIP に代表される、画像とテキストのペアに関する対照的な事前トレーニングは、マルチモーダルな視覚言語表現を学習するための標準的な手法になります。
CLIP は優れたパフォーマンスを示していますが、ノイズの多い Web スケールのデータセットでゼロからトレーニングするのは計算量が多くなります。
一方、マスク画像モデリング (MIM) のような、マスクしてから予測する事前トレーニングアプローチは、単一モーダル表現に対して効率的な自己教師あり学習を提供します。
このペーパーでは、既存の CLIP モデルを活用してビジョン言語表現をさらに強化する方法である Unmasked Token Alignment (UTA) を紹介します。
UTA は、マスクされていないビジュアルトークンを、フリーズされた CLIP ビジョンエンコーダーからの対応する画像トークンに位置合わせすることによって、ビジョントランスフォーマー (ViT) をトレーニングします。これにより、ViT モデルが CLIP テキストエンコーダーと自動的に位置合わせされます。
事前トレーニングされた ViT は、画像とテキストのペアでトレーニングしなくても、ゼロショット評価に直接適用できます。
MIM アプローチと比較して、UTA はトレーニングの微調整の不一致に悩まされず、余分な [MASK] トークンの使用を回避することでトレーニング効率が大幅に向上します。
広範な実験結果は、UTA が CLIP モデルを強化し、さまざまなユニモーダルおよびマルチモーダルベンチマークで既存の MIM 手法を上回るパフォーマンスを発揮できることを示しています。
コードとモデルは https://github.com/jihaonew/UTA で入手できます。

要約(オリジナル)

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.

arxiv情報

著者	Jihao Liu,Jinliang Zheng,Boxiao Liu,Yu Liu,Hongsheng Li
発行日	2024-06-14 14:29:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Model with Unmasked Token Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー