MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

要約

深層学習が復活して以来、大規模言語モデル (LLM) によって強化されたビジョン言語モデル (VLM) の人気が飛躍的に高まりました。
ただし、LLM はコンテキスト内学習で広範な背景知識とタスク情報を利用できますが、ほとんどの VLM は依然として複数の画像を含む複雑なマルチモーダルプロンプトを理解するのに苦労しており、下流の視覚言語タスクでの VLM の効果が低くなります。
この論文では、1) VLM がマルチモーダル入力を効率的に処理できるようにする新しいアプローチである、マルチモーダルインコンテキスト学習によるビジョン言語モデル (MMICL) を導入することによって、上記の制限に対処します。
2) VLM のコンテキスト内学習能力を強化する新しいコンテキストスキームを提案します。
3) 複雑なマルチモーダルプロンプトを理解する VLM の能力を強化するように設計された、マルチモーダルインコンテキスト学習 (MIC) データセットの構築。
私たちの実験では、MMICL が幅広い一般的なビジョン言語タスク、特に MME や MMBench などの複雑なベンチマークに対して、新しい最先端のゼロショットパフォーマンスを達成することが確認されました。
私たちの分析は、MMICL が複雑なマルチモーダルな迅速な理解という課題に効果的に取り組み、優れた ICL 能力を発揮していることを示しています。
さらに、MMICL が VLM の言語バイアスをうまく軽減していることも観察しています。この言語バイアスは、広範なテキストのコンテキストに直面したときに幻覚を引き起こすことが多い VLM の共通の問題です。
コード、データセット、データセットツール、モデルは https://github.com/PKUnlp-icler/MIC で入手できます。

要約(オリジナル)

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM’s ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC

arxiv情報

著者	Haozhe Zhao,Zefan Cai,Shuzheng Si,Xiaojian Ma,Kaikai An,Liang Chen,Zixuan Liu,Sheng Wang,Wenjuan Han,Baobao Chang
発行日	2024-03-20 16:17:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー