Variational Visual Question Answering

要約

視覚的な質問応答（VQA）のマルチモーダルモデルでは顕著な進歩にもかかわらず、特に分散除外（OOD）設定では、モデルがしばしば自信過剰で誤りがある可能性があるため、大きな信頼性の懸念が残ります。
ユニモーダルモデルのこのような問題に対処するために多くのことが行われていますが、マルチモーダルのケースではほとんど存在しません。
ここでは、変分VQAアプローチを提案することにより、マルチモーダルモデルの信頼性に対応します。
具体的には、ADAMWを使用して微調整されたビジョン言語モデルの代わりに、Ivonと呼ばれる最近提案された変分アルゴリズムを採用しています。これにより、モデルパラメーターよりも後部分布が得られます。
広範な実験を通じて、私たちのアプローチは、Adamwの精度を犠牲にすることなく、キャリブレーションと棄権を改善することを示しています。
たとえば、Adamwの微調整と比較して、AdamWベースラインと比較して予想キャリブレーションエラーを50％以上削減し、SOTA対SOTA（固定リスクの場合）とSOTAを4％増加させます。
分布シフトの存在下では、パフォーマンスゲインがさらに高く、テストケースの50％がOODである場合、8％のカバレッジ（@ 1％のリスク）改善とSOTAを達成します。
全体として、マルチモーダルモデルの信頼性を高めるための実行可能なオプションとして変分学習を提示します。

要約(オリジナル)

Despite remarkable progress in multimodal models for Visual Question Answering (VQA), there remain major reliability concerns because the models can often be overconfident and miscalibrated, especially in out-of-distribution (OOD) settings. Plenty has been done to address such issues for unimodal models, but little work exists for multimodal cases. Here, we address unreliability in multimodal models by proposing a Variational VQA approach. Specifically, instead of fine-tuning vision-language models by using AdamW, we employ a recently proposed variational algorithm called IVON, which yields a posterior distribution over model parameters. Through extensive experiments, we show that our approach improves calibration and abstentions without sacrificing the accuracy of AdamW. For instance, compared to AdamW fine-tuning, we reduce Expected Calibration Error by more than 50% compared to the AdamW baseline and raise Coverage by 4% vs. SOTA (for a fixed risk of 1%). In the presence of distribution shifts, the performance gain is even higher, achieving 8% Coverage (@ 1% risk) improvement vs. SOTA when 50% of test cases are OOD. Overall, we present variational learning as a viable option to enhance the reliability of multimodal models.

arxiv情報

著者	Tobias Jan Wieczorek,Nathalie Daun,Mohammad Emtiyaz Khan,Marcus Rohrbach
発行日	2025-05-14 17:40:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Variational Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー