UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

要約

VQA、SNLI-VE、VCR などの視覚言語タスクは、視覚世界と自然言語のセマンティクスを理解するためのモデルの推論能力を必要とするため、困難です。
視覚言語タスクに有効な教師あり手法はよく研究されています。
ただし、ゼロショット設定でこれらのタスクを解決することはあまり検討されていません。
Contrastive Language-Image Pre-training (CLIP) は、画像とテキストのマッチングにおいて顕著なゼロショットパフォーマンスを示しているため、以前の研究では視覚言語タスクを画像とテキストのマッチング問題に変換することでその強力なゼロショット能力を活用しており、主に
グローバルレベルのマッチング (画像全体や文章など) を考慮してください。
しかし、文中のキーワードや画像内のオブジェクトなど、視覚的およびテキストの細かい情報は、意味論の理解にかなり有益であることがわかりました。
これにインスピレーションを得て、私たちは、VQA、SNLI-VE、VCR などの複数のタスクをカバーする、ゼロショット視覚言語学習のためのきめ細かい情報を活用するための統一フレームワークを提案します。
私たちの実験では、私たちのフレームワークが VQA で以前のゼロショット手法を上回っており、SNLI-VE と VCR で大幅な改善を達成していることが示されています。
さらに、我々のアブレーション研究により、我々が提案した方法の有効性と一般化可能性が確認されています。
コードは https://github.com/ThreeSR/UniFine で入手できます。

要約(オリジナル)

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model’s reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

arxiv情報

著者	Rui Sun,Zhecan Wang,Haoxuan You,Noel Codella,Kai-Wei Chang,Shih-Fu Chang
発行日	2023-07-03 09:03:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー