Analysing the Robustness of Vision-Language-Models to Common Corruptions

要約

Vision-Language Models（VLMS）は、視覚的およびテキストコンテンツについて理解と推論において印象的な能力を実証しています。
ただし、一般的なイメージの腐敗に対する堅牢性は未調査のままです。
この作業では、Imagenet-Cベンチマークからの19の腐敗タイプにわたるVLM堅牢性の最初の包括的な分析を紹介します。
腐敗がそれぞれシーンのテキストの理解とオブジェクトベースの推論にどのように影響するかを体系的に評価するために、TextVQA-CとGQA-Cの2つの新しいベンチマークを導入します。
私たちの分析では、変圧器ベースのVLMがタスク全体で明確な脆弱性パターンを示すことが明らかになりました。テキスト認識はぼやけや雪の腐敗の下で最も著しく悪化しますが、オブジェクトの推論は、霜や衝動騒音などの腐敗に対する感度が高いことを示しています。
これらの観察結果をさまざまな腐敗の周波数ドメイン特性に接続し、低周波処理に対する変圧器の固有のバイアスが、それらの違いの堅牢性パターンをどのように説明するかを明らかにします。
私たちの調査結果は、実際のアプリケーション向けに、より多くの腐敗と頑丈なビジョン言語モデルを開発するための貴重な洞察を提供します。

要約(オリジナル)

Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers’ inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.

arxiv情報

著者	Muhammad Usama,Syeda Aishah Asim,Syed Bilal Ali,Syed Talal Wasim,Umair Bin Mansoor
発行日	2025-04-21 17:07:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Analysing the Robustness of Vision-Language-Models to Common Corruptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー