MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

要約

マルチモーダル大手言語モデル（MLLM）は、近年、視覚認識タスクの急速な進歩を経験しています。
多くの重要なアプリケーションへの潜在的な統合を考えると、視覚的認識の限界を理解することが重要です。
この作業では、MLLMSが画像に関する質問に答えるときに、小さな視覚的詳細を大きなものと同じくらい効果的に知覚できるかどうかを調べます。
彼らのパフォーマンスは、質問の視覚的な主題のサイズに非常に敏感であることを観察し、さらにこの効果が介入研究を実施することによって実際に因果的であることを示しています。
次に、視覚的な質問に答えるときにMLLMの注意パターンを研究し、間違った答えを提供したとしても、どこを見るべきかを一貫して知っていることがあります。
これらの調査結果に基づいて、注意とグラデーションマップの形でMLLM自体の内部知識を活用して、小さな視覚的詳細の認識を高めるトレーニングフリーの視覚介入方法を提案します。
2つの広く使用されているMLLMと7つの視覚的質問にベンチマークに応答する7つの視覚的質問で提案された方法を評価し、トレーニングを必要とせずにMLLMの精度を大幅に改善できることを示します。
我々の結果は、MLLMSを小さな詳細に関する視覚認識タスクに適用するリスクを解明し、モデルの内部状態を使用した視覚的介入がこのリスクを軽減する有望な方向であることを示しています。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs’ accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model’s internal state is a promising direction to mitigate this risk.

arxiv情報

著者	Jiarui Zhang,Mahyar Khayatkhoei,Prateek Chhikara,Filip Ilievski
発行日	2025-02-24 18:54:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー