How Well Can Vision Language Models See Image Details?

要約

大規模言語モデルベースの視覚言語モデル (LLM ベースの VLM) は、さまざまな視覚言語理解タスクにおいて優れた結果を示しています。
ただし、これらの VLM が意味レベルを超えて画像の詳細をどの程度認識できるかは不明のままです。
私たちの研究では、「視覚言語モデルは画像の詳細をどの程度認識できるか?」を調査するためにピクセル値予測タスク (PVP) を導入します。
VLM がより詳細を認識できるように支援します。
通常、これらのモデルは、凍結された CLIP ビジュアルエンコーダー、大規模な言語モデル、および接続モジュールで構成されます。
PVP タスクで VLM を微調整した後、次のことがわかりました。1) 既存の VLM は、接続モジュールと LLM を微調整するだけでは、正確なピクセル値を予測するのに苦労しています。
2) ビジョンエンコーダも適応すると、予測精度が大幅に向上します。
さらに、私たちの調査では、VLM の事前トレーニングタスクの 1 つとしてピクセル値予測を組み込むことと、ビジョンエンコーダーの適応により、参照画像のセグメンテーションなど、詳細な画像認識を必要とする下流の画像言語理解タスクにおける VLM のパフォーマンスが大幅に向上することが明らかになりました (平均 +10.19)。
cIoU の向上）とビデオゲームの意思決定（2 つのゲームでそれぞれ +80.34 と +70.54 の平均スコア向上）。

要約(オリジナル)

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore ‘How Well Can Vision Language Models See Image Details?’ and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).

arxiv情報

著者	Chenhui Gou,Abdulwahab Felemban,Faizan Farooq Khan,Deyao Zhu,Jianfei Cai,Hamid Rezatofighi,Mohamed Elhoseiny
発行日	2024-08-07 17:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How Well Can Vision Language Models See Image Details?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー