Fine-Grained Semantically Aligned Vision-Language Pre-Training

要約

大規模な視覚言語の事前トレーニングは、幅広いダウンストリームタスクで目覚ましい進歩を遂げています。
既存の方法は主に、画像とテキストのグローバルな表現の類似性、または画像とテキストの機能に対する高度なクロスモーダルな注意によって、クロスモーダルの配置をモデル化します。
ただし、グローバルな画像とテキストの配置情報しか利用できないため、視覚領域とテキスト句の間のきめの細かい意味的配置を明示的に学習できません。
この論文では、LOUPE を紹介します。LOUPE は、ゲーム理論的相互作用の新しい観点から、きめの細かい意味論的アラインメントを学習します。
ゲーム理論の相互作用を効率的に計算するために、不確実性を認識するニューラル Shapley 相互作用学習モジュールをさらに提案します。
実験では、LOUPE がさまざまな視覚言語タスクで最先端のパフォーマンスを達成することが示されています。
さらに、オブジェクトレベルのヒューマンアノテーションや微調整を行わなくても、LOUPE はオブジェクト検出と視覚的グラウンディングで競争力のあるパフォーマンスを実現します。
さらに重要なことに、LOUPE は、大規模な生の画像とテキストのペアからきめの細かいセマンティクスを学習するという新しい有望な方向性を開きます。
この作品のリポジトリは https://github.com/YYJMJC/LOUPE にあります。

要約(オリジナル)

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs. The repository of this work is at https://github.com/YYJMJC/LOUPE.

arxiv情報

著者	Juncheng Li,Xin He,Longhui Wei,Long Qian,Linchao Zhu,Lingxi Xie,Yueting Zhuang,Qi Tian,Siliang Tang
発行日	2022-09-19 14:50:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fine-Grained Semantically Aligned Vision-Language Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー