BLINK: Multimodal Large Language Models Can See but Not Perceive

要約

他の評価では見られない中核的な視覚認識能力に焦点を当てた、マルチモーダル言語モデル (LLM) の新しいベンチマークである Blink を紹介します。
Blink タスクのほとんどは、人間が「瞬きのうちに」解決できます (相対的な奥行きの推定、視覚的対応、フォレンジック検出、マルチビュー推論など)。
しかし、現在のマルチモーダル LLM は自然言語による仲介に抵抗するため、これらの認識を必要とするタスクは大きな課題を投げかけていることがわかりました。
Blink は、14 の古典的なコンピュータービジョンタスクを 3,807 の多肢選択式の質問に再フォーマットし、単一または複数の画像と視覚的なプロンプトを組み合わせます。
人間の精度は平均 95.70% ですが、Blink は既存のマルチモーダル LLM にとって驚くほど困難です。最高のパフォーマンスを誇る GPT-4V と Gemini でさえ 51.26% と 45.72% の精度を達成しており、ランダムな推測よりもわずか 13.17% と 7.63% 高いだけです。
このような知覚能力は、最近のマルチモーダル LLM ではまだ「出現」していません。
私たちの分析では、専門の CV モデルがこれらの問題をより適切に解決できる可能性があることも強調しており、将来の改善に向けた潜在的な道筋を示唆しています。
私たちは、Blink がコミュニティを刺激して、マルチモーダル LLM が人間レベルの視覚認識に追いつくのを支援すると信じています。

要約(オリジナル)

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans ‘within a blink’ (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not ‘emerged’ yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

arxiv情報

著者	Xingyu Fu,Yushi Hu,Bangzheng Li,Yu Feng,Haoyu Wang,Xudong Lin,Dan Roth,Noah A. Smith,Wei-Chiu Ma,Ranjay Krishna
発行日	2024-04-18 17:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BLINK: Multimodal Large Language Models Can See but Not Perceive

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー