Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

要約

ゼロショット異常検出（ZSAD）は、新興広告パラダイムです。
モデルをトレーニングするために多数の通常のサンプルを必要とする従来の監視なしの広告設定とは異なり、ZSADはデータ制限の実世界のシナリオを処理するためにより実用的です。
最近、マルチモーダル大手言語モデル（MLLM）は、さまざまなビジョンタスクで革新的な推論能力を示しています。
ただし、対応するデータセットとベンチマークがないため、画像の異常の理由は露出度の低いままです。
AD＆Reasoningの研究を容易にするために、最初の視覚命令チューニングデータセット、Anomaly-Instruct-125K、および評価ベンチマークであるVisa-D＆Rを確立します。
ベンチマークを使用した調査を通じて、GPT-4oのような現在のMLLMは、画像の細粒の異常の詳細を正確に検出して記述できないことを明らかにします。
これに対処するために、ZSADおよび推論の最初の専門のビジュアルアシスタントであるAnomaly-onevision（Anomaly-ov）を提案します。
目視検査における人間の行動に触発されたAnomaly-ovは、異常な視覚トークンを適応的に選択し、強調するために、Look-Twice Featureマッチング（LTFM）メカニズムを活用します。
広範な実験は、異常が検出と推論の両方において、高度なジェネラリストモデルよりも大幅な改善を達成することを示しています。
医療および3D ADへの拡張は、将来の研究のために提供されます。
プロジェクトページへのリンク：https：//xujiacong.github.io/anomaly-ov/

要約(オリジナル)

Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

arxiv情報

著者	Jiacong Xu,Shao-Yuan Lo,Bardia Safaei,Vishal M. Patel,Isht Dwivedi
発行日	2025-02-11 14:50:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー