Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

要約

ビジョン言語モデル（VLM）は、言語のみの対応物の特性と同様の特性であるコンテキスト内学習（ICL）を示すと広く想定されています。
最近の研究では、VLMがマルチモーダルICL（MM-ICL）を実行できることが示唆されていますが、研究は、真のタスク理解ではなく、コピーや多数票などの浅いヒューリスティックに依存することが多いことを示しています。
サポートの例がクエリとは異なるデータセットから得られる分布シフトでVLMを評価することにより、この仮定を再検討します。
驚くべきことに、パフォーマンスはしばしばより多くのデモンストレーションで劣化し、モデルは回答から学ぶのではなく、回答をコピーする傾向があります。
さらに調査するために、回答とともに生成された理論的根拠を備えた各デモンストレーションを強化する推論パイプラインを備えた新しいMM-ICLを提案します。
3Bから72Bの範囲のオープンソースVLMとGEMINI 2.0などの独自モデルを備えた、知覚および推論要求の両方のデータセットの両方で、広範かつ包括的な実験を実施します。
さまざまなショットカウント、検索方法、理論的品質、および分布を制御した研究を実施します。
私たちの結果は、これらの要因全体でパフォーマンスの感度が限られていることを示しており、現在のVLMがMM-ICLで意図されているようにデモレベルの情報を効果的に利用しないことを示唆しています。

要約(オリジナル)

Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics — such as copying or majority voting — rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

arxiv情報

著者	Chengyue Huang,Yuchen Zhu,Sichen Zhu,Jingyun Xiao,Moises Andrade,Shivang Chopra,Zsolt Kira
発行日	2025-06-09 16:55:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー