Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

要約

異常検出は、生産ラインの異常なパターンの特定や品質管理のための製造欠陥の検出など、さまざまな産業シナリオで不可欠です。
既存の手法は、個々のシナリオに特化している傾向があり、一般化能力がありません。
この研究では、私たちの目的は、複数のシナリオで適用できる一般的な異常検出モデルを開発することです。
これを達成するために、幅広い知識と堅牢な推論能力を異常検出器と推論者として持っている一般的な視覚言語基礎モデルをカスタムビルドします。
具体的には、モデルを導く条件として専門家からのドメイン知識を組み込むマルチモーダルプロンプト戦略を導入します。
私たちのアプローチでは、タスクの説明、クラスのコンテキスト、正規性ルール、参照画像など、多様なプロンプトタイプを考慮しています。
さらに、マルチモダリティの入力表現を2D画像形式に統合し、マルチモーダルの異常検出と推論を可能にします。
私たちの予備研究は、視覚と言語を組み合わせてモデルをカスタマイズするための条件としてプロンプトを組み合わせることで、異常検出のパフォーマンスが向上することを示しています。
カスタマイズされたモデルは、画像、ポイントクラウド、ビデオなどのさまざまなデータモダリティにわたって異常を検出する機能を示しています。
定性的ケーススタディは、特にマルチオブジェクトシーンと時間データの異常検出機能と推論機能をさらに強調しています。
私たちのコードは、https://github.com/xiaohao-xu/customizable-vlmで公開されています

要約(オリジナル)

Anomaly detection is vital in various industrial scenarios, including the identification of unusual patterns in production lines and the detection of manufacturing defects for quality control. Existing techniques tend to be specialized in individual scenarios and lack generalization capacities. In this study, our objective is to develop a generic anomaly detection model that can be applied in multiple scenarios. To achieve this, we custom-build generic visual language foundation models that possess extensive knowledge and robust reasoning abilities as anomaly detectors and reasoners. Specifically, we introduce a multi-modal prompting strategy that incorporates domain knowledge from experts as conditions to guide the models. Our approach considers diverse prompt types, including task descriptions, class context, normality rules, and reference images. In addition, we unify the input representation of multi-modality into a 2D image format, enabling multi-modal anomaly detection and reasoning. Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance. The customized models showcase the ability to detect anomalies across different data modalities such as images, point clouds, and videos. Qualitative case studies further highlight the anomaly detection and reasoning capabilities, particularly for multi-object scenes and temporal data. Our code is publicly available at https://github.com/Xiaohao-Xu/Customizable-VLM

arxiv情報

著者	Xiaohao Xu,Yunkang Cao,Huaxin Zhang,Nong Sang,Xiaonan Huang
発行日	2025-05-16 13:04:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー