A Simple Aerial Detection Baseline of Multimodal Language Models

要約

生成的な事前トレーニング済み Transformer に基づくマルチモーダル言語モデル (MLM) は、さまざまなドメインとタスクを統合するための強力な候補と考えられています。
リモートセンシング (RS) 用に開発された MLM は、視覚的な質問応答や視覚的なグラウンディングなど、複数のタスクで優れたパフォーマンスを実証しています。
与えられた指示に対応する特定の物体を検出する視覚的接地に加えて、複数のカテゴリのすべての物体を検出する空中検出も、RS 基礎モデルにとって価値があり、やりがいのあるタスクです。
ただし、MLM の自己回帰予測メカニズムは検出出力とは大きく異なるため、空中検出は既存の RS MLM では検討されていません。
この論文では、MLM を航空探知に初めて適用するための、LMMRotate という名前の単純なベースラインを紹介します。
具体的には、まず、MLM フレームワークと互換性のある検出出力をテキスト出力に変換する正規化方法を導入します。
次に、MLM と従来の物体検出モデルを公平に比較するための評価手法を提案します。
オープンソースの汎用 MLM を微調整してベースラインを構築し、従来の検出器に匹敵する優れた検出性能を実現します。
このベースラインが将来の MLM 開発の参考として機能し、RS 画像を理解するためのより包括的な機能が可能になることを願っています。
コードは https://github.com/Li-Qingyun/mllm-mmrotate で入手できます。

要約(オリジナル)

The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at https://github.com/Li-Qingyun/mllm-mmrotate.

arxiv情報

著者	Qingyun Li,Yushi Chen,Xinya Shu,Dong Chen,Xin He,Yi Yu,Xue Yang
発行日	2025-01-16 18:09:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Simple Aerial Detection Baseline of Multimodal Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー