Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

要約

大規模なビジョンおよび言語モデルにより、完全に監視されたゼロショットの視覚タスクが大幅に進歩しました。
これらの大規模なアーキテクチャは、現在命令チューニング大規模ビジョンおよび言語モデル (IT-LVLM) として知られているもののベースラインとして機能します。
IT-LVLM は、自然言語命令と視覚データによって応答が調整される汎用マルチモーダルアシスタントです。
この多用途性にもかかわらず、基本的なコンピュータビジョンの問題に対する IT-LVLM の有効性は、主に標準化された評価ベンチマークが存在しないため、依然として不明瞭です。
このペーパーでは、基本的なコンピュータービジョンタスクに関する IT-LVLM の機能を評価するためのスケーラブルなテストベッドである MERLIM という名前のマルチモーダル評価ベンチマークを紹介します。
MERLIM には 30 万を超える画像と質問のペアが含まれており、IT-LVLM におけるクロスモーダルの「幻覚」イベントの検出に重点を置いています。
私たちの結果は、きめの細かい視覚概念の識別における限界、タスク全体にわたる物体の幻覚、言語クエリに対する偏見など、最先端の IT-LVML のパフォーマンスに関する重要な洞察をもたらします。
私たちの調査結果は、これらのモデルは視覚的な根拠が弱いものの、LLM コンポーネントに含まれる全体的な視覚パターンや言語バイアスから適切な推測を行うことができることも示唆しています。

要約(オリジナル)

Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot visual tasks. These large architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal ‘hallucination’ events in IT-LVLMs. Our results bring important insights on the performance of state-of-the-art IT-LVMLs including limitations at identifying fine-grained visual concepts, object hallucinations across tasks, and biases towards the language query. Our findings also suggest that these models have weak visual grounding, but manage to make adequate guesses from global visual patterns or language biases contained in the LLM component.

arxiv情報

著者	Andrés Villa,Juan Carlos León Alcázar,Alvaro Soto,Bernard Ghanem
発行日	2024-06-12 14:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー