Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

要約

不確実性の定量化は、現代のAIシステムの信頼性と信頼性を評価するために不可欠です。
既存のアプローチの中で、自然言語を通してモデルが自信を表現する言葉である不確実性は、大規模な言語モデル（LLM）で軽量で解釈可能な解決策として浮上しています。
ただし、Vision-Language Models（VLM）におけるその有効性は、まだ十分に研究されていません。
この作業では、VLMに対する言葉による信頼性の包括的な評価を実施し、3つのモデルカテゴリ、4つのタスクドメイン、3つの評価シナリオにまたがります。
私たちの結果は、現在のVLMがしばしば、多様なタスクと設定全体に顕著な誤りを表示することを示しています。
特に、視覚的推論モデル（つまり、画像で考える）は一貫してより良いキャリブレーションを示し、信頼できる不確実性の推定にはモダリティ固有の推論が重要であることを示唆しています。
キャリブレーションの課題にさらに対処するために、マルチモーダル設定での信頼度の調整を改善する2段階のプロンプト戦略である視覚的な自信対応プロンプトを紹介します。
全体として、私たちの研究では、モダリティ全体のVLMSの固有の誤りを強調しています。
さらに広く言えば、私たちの調査結果は、信頼できるマルチモーダルシステムの進歩におけるモダリティアラインメントとモデルの忠実さの基本的な重要性を強調しています。

要約(オリジナル)

Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

arxiv情報

著者	Weihao Xuan,Qingcheng Zeng,Heli Qi,Junjue Wang,Naoto Yokoya
発行日	2025-05-26 17:16:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー