Vision language models are blind: Failing to translate detailed visual features into words

要約

視覚能力（VLM）を備えた大規模な言語モデル、たとえばGPT-4OやGemini 1.5 Proは、多くのビジョン理解ベンチマークで高いスコアを獲得していますが、人間にとって簡単な低レベルのビジョンタスクに苦しんでいます。
具体的には、ブラインドテストでは、（a）2つの円が重複するかどうかを識別するなど、7つの非常に簡単なタスクのスイート。
（b）2行の交差する回数。
（c）単語でどの文字が一周されているか。
（d）オリンピックのようなロゴのサークルの数、4つの最先端のVLMは平均で58.07％の正確です。
Claude 3.5 Sonnetは、100％の人間の予想精度とはほど遠い、77.84％の精度で最高のパフォーマンスを発揮します。
さまざまな画像解像度とライン幅にわたって、ゆっくりと考えているモデルを含むVLMは、幾何学的なプリミティブが重複したり近い場合に正確な空間情報を必要とするタスクと一貫して闘っています。
しかし、VLMは、個別の形状と文字にはるかに多くのスペースが追加されると、100％近くの精度で機能します。
線形調査実験は、ビジョンエンコーダーがブラインドテストを解決するのに十分な視覚情報が含まれており、言語モデルがこの情報を正解にデコードできないことを示しています。
コードとデータは、https：//vlmsareblind.github.ioにあります

要約(オリジナル)

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100% accuracy when much more space is added to separate shapes and letters. Linear probing experiments show that vision encoders contain sufficient visual information to solve BlindTest and that language models fail to decode this information into correct answers. Code and data are at: https://vlmsareblind.github.io

arxiv情報

著者	Pooyan Rahmanzadehgervi,Logan Bolton,Mohammad Reza Taesiri,Anh Totti Nguyen
発行日	2025-03-27 16:16:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision language models are blind: Failing to translate detailed visual features into words

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー