PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

要約

この論文では、漸進的に調整された言語モデルがフリーズビジョンエンコーダとラージ言語モデル (LLM) を効果的に橋渡しできることを実証します。
ビジョンエンコーダと LLM の基本的なアーキテクチャと事前トレーニング方法は広範囲に研究されていますが、ビジョン言語アダプタのアーキテクチャとトレーニング戦略は最近の研究によって大きく異なります。
私たちの研究では、最先端の知覚リサンプラーアーキテクチャを徹底的に調査し、強力なベースラインを構築します。
ただし、知覚者リサンプラーとの視覚言語の調整は、直接の監視が不足しているため、収束が遅く、拡張性が限られていることが観察されています。
この問題に対処するために、ビジョン言語アダプターとして漸進的に調整された言語モデルを採用する PaLM2-VAdapter を提案します。
知覚リサンプラーを使用した強力なベースラインと比較して、私たちの方法は、より高速な収束、より高いパフォーマンス、より強力なスケーラビリティを経験的に示しています。
画像とビデオの両方に対するさまざまな視覚的質問応答 (VQA) およびキャプションタスクにわたる広範な実験により、私たちのモデルが最先端の視覚的理解とマルチモーダル推論機能を示すことが実証されました。
特に、私たちの手法は、最先端の大規模ビジョン言語モデルよりも 30 ～ 70% 少ないパラメータでこれらの進歩を達成しており、効率が大幅に向上しています。

要約(オリジナル)

This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.

arxiv情報

著者	Junfei Xiao,Zheng Xu,Alan Yuille,Shen Yan,Boyu Wang
発行日	2024-02-16 18:54:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー