F-ViTA: Foundation Model Guided Visible to Thermal Translation

要約

赤外線画像は、特に低照度や夜間の状況把握に欠かせない。しかし、赤外線画像の撮影には特殊な装置が必要なため、大規模な赤外線データセットの収集にはコストと労力がかかる。この課題に対処するため、研究者は可視画像から熱画像への変換を模索してきた。既存の手法のほとんどは、生成逆説的ネットワーク（GAN）や拡散モデル（DM）に依存しており、タスクをスタイル変換問題として扱っている。その結果、これらのアプローチは、限られた学習データから、モダリティ分布シフトと基礎となる物理原理の両方を学習しようとする。本論文では、F-ViTAを提案する。F-ViTAは、基礎モデルに埋め込まれた一般的な世界知識を活用し、翻訳を改善するための拡散プロセスを導く新しいアプローチである。具体的には、InstructPix2Pix拡散モデルを、SAMやGrounded DINOのような基礎モデルからのゼロショットマスクとラベルで条件付ける。これにより、このモデルはシーンオブジェクトと赤外線画像中の熱シグネチャとの間の意味のある相関関係を学習することができる。5つの公開データセットを用いた広範な実験により、F-ViTAが最先端の（SOTA）手法を凌駕することが実証された。さらに、我々のモデルは分布外（OOD）シナリオによく一般化し、同じ可視画像から長波赤外線（LWIR）、中波赤外線（MWIR）、近赤外線（NIR）の変換を生成することができる。コード: https://github.com/JayParanjape/F-ViTA/tree/master.

要約(オリジナル)

Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.

arxiv情報

著者	Jay N. Paranjape,Celso de Melo,Vishal M. Patel
発行日	2025-04-03 17:47:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

F-ViTA: Foundation Model Guided Visible to Thermal Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー