Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks

要約

リモートセンシング・シーン分類（RSSC）は、土地利用や資源管理における様々な用途で重要なタスクである。ユニモーダルな画像ベースのアプローチは有望であるが、クラス内分散やクラス間類似度が高いなどの制限に悩まされることが多い。テキスト情報を組み込むことで、追加のコンテキストと意味理解を提供することで分類を強化することができるが、手作業によるテキストアノテーションは労力とコストがかかる。本研究では、大規模な視覚言語モデル（VLM）により生成されたテキスト記述を、高価な手動アノテーションコストをかけずに補助モダリティとして統合する、新しいRSSCフレームワークを提案する。視覚データとテキストデータ間の潜在的な相補性を十分に活用するために、これらのモダリティを統合表現に融合するデュアルクロスアテンションベースのネットワークを提案する。5つのRSSCデータセットにおける定量的・定性的評価の広範な実験により、我々のフレームワークがベースラインモデルを一貫して上回ることを実証する。また、VLMが生成したテキスト記述の有効性を、人間が注釈を付けた記述と比較して検証する。さらに、ゼロショット分類シナリオを設計し、学習されたマルチモーダル表現が未見のクラス分類に効果的に利用できることを示す。この研究は、RSSCタスクにおいてテキスト情報を活用する新たな機会を開くとともに、有望なマルチモーダル融合構造を提供し、将来の研究に対する洞察とインスピレーションを提供する。コードはhttps://github.com/CJR7/MultiAtt-RSSC。

要約(オリジナル)

Remote sensing scene classification (RSSC) is a critical task with diverse applications in land use and resource management. While unimodal image-based approaches show promise, they often struggle with limitations such as high intra-class variance and inter-class similarity. Incorporating textual information can enhance classification by providing additional context and semantic understanding, but manual text annotation is labor-intensive and costly. In this work, we propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs. To fully leverage the latent complementarities between visual and textual data, we propose a dual cross-attention-based network to fuse these modalities into a unified representation. Extensive experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models. We also verify the effectiveness of VLM-generated text descriptions compared to human-annotated descriptions. Additionally, we design a zero-shot classification scenario to show that the learned multimodal representation can be effectively utilized for unseen class classification. This research opens new opportunities for leveraging textual information in RSSC tasks and provides a promising multimodal fusion structure, offering insights and inspiration for future studies. Code is available at: https://github.com/CJR7/MultiAtt-RSSC

arxiv情報

著者	Jinjin Cai,Kexin Meng,Baijian Yang,Gang Shao
発行日	2024-12-03 16:24:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー