A Simple Aerial Detection Baseline of Multimodal Language Models

要約

生成事前訓練を受けた変圧器に基づくマルチモーダル言語モデル（MLMS）は、さまざまなドメインとタスクを統合するための強力な候補と見なされます。
リモートセンシング（RS）のために開発されたMLMSは、視覚的な質問応答や視覚的接地など、複数のタスクで優れたパフォーマンスを実証しています。
特定のオブジェクトを検出する視覚的接地に加えて、指定された命令に対応すると、複数のカテゴリのすべてのオブジェクトを検出する空中検出も、RS Foundationモデルにとって貴重で挑戦的なタスクです。
ただし、MLMSの自己回帰予測メカニズムは検出出力とは大きく異なるため、既存のRS MLMによって空中検出は調査されていません。
この論文では、lmmrotateという名前の航空検出にMLMSを初めて適用するための簡単なベースラインを紹介します。
具体的には、MLMフレームワークと互換性があるために、検出出力をテキスト出力に変換する正規化方法を最初に導入します。
次に、MLMSと従来のオブジェクト検出モデルの公正な比較を保証する評価方法を提案します。
微調整オープンソースの汎用MLMSによりベースラインを構築し、従来の検出器に匹敵する印象的な検出性能を実現します。
このベースラインが、将来のMLM開発のリファレンスとして機能し、RS画像を理解するためのより包括的な機能を可能にすることを願っています。
コードはhttps://github.com/li-qingyun/mllmmmrotateで入手できます。

要約(オリジナル)

The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at https://github.com/Li-Qingyun/mllm-mmrotate.

arxiv情報

著者	Qingyun Li,Yushi Chen,Xinya Shu,Dong Chen,Xin He,Yi Yu,Xue Yang
発行日	2025-01-23 14:11:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Simple Aerial Detection Baseline of Multimodal Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー