Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

要約

自動目標認識 (ATR) は、安全性と精度が最優先されるナビゲーションや監視などのタスクにおいて重要な役割を果たします。
軍事用途などの極端な使用例では、未知の地形、環境条件、新しいオブジェクトカテゴリの存在により、これらの要素が困難になることがよくあります。
オープンワールド検出器を含む現在の物体検出器は、これらの新しい条件にさらされていないため、新しい物体を自信を持って認識したり、未知の環境で動作したりする能力に欠けています。
ただし、Large Vision-Language Model (LVLM) は、さまざまな条件にあるオブジェクトをゼロショットで認識できるようにする創発的な特性を示します。
それにもかかわらず、LVLM はシーン内でオブジェクトを効果的に位置特定するのに苦労します。
これらの制限に対処するために、オープンワールド検出器の検出機能と LVLM の認識信頼性を組み合わせた新しいパイプラインを提案し、新しいクラスと未知のドメインのゼロショット ATR のための堅牢なシステムを作成します。
この研究では、トレーニングデータセットでは過小評価されることが多い軍用車両を認識するためのさまざまな LVLM のパフォーマンスを比較します。
さらに、距離範囲、モダリティ、プロンプト方法などの要因が認識パフォーマンスに及ぼす影響を調査し、新しい条件やクラス向けのより信頼性の高い ATR システムの開発への洞察を提供します。

要約(オリジナル)

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the ability to confidently recognize novel objects or operate in unknown environments, as they have not been exposed to these new conditions. However, Large Vision-Language Models (LVLMs) exhibit emergent properties that enable them to recognize objects in varying conditions in a zero-shot manner. Despite this, LVLMs struggle to localize objects effectively within a scene. To address these limitations, we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains. In this study, we compare the performance of various LVLMs for recognizing military vehicles, which are often underrepresented in training datasets. Additionally, we examine the impact of factors such as distance range, modality, and prompting methods on the recognition performance, providing insights into the development of more reliable ATR systems for novel conditions and classes.

arxiv情報

著者	Yasiru Ranasinghe,Vibashan VS,James Uplinger,Celso De Melo,Vishal M. Patel
発行日	2025-01-13 15:11:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー