The Power of Context: How Multimodality Improves Image Super-Resolution

要約

シングルイメージの超解像度（SISR）は、細かい詳細を回復し、低解像度の入力から知覚品質を維持することの固有の難しさのために、依然として挑戦的です。
既存の方法は、多くの場合、限られた画像事前に依存しており、最適ではない結果につながります。
深さ、セグメンテーション、エッジ、テキストプロンプトなど、複数のモダリティで利用可能な豊富なコンテキスト情報を活用して、拡散モデルフレームワーク内でSISRの強力な生成事前を学習する新しいアプローチを提案します。
マルチモーダル情報を効果的に融合する柔軟なネットワークアーキテクチャを導入し、拡散プロセスに大幅な変更を必要とせずに、任意の数の入力モダリティに対応します。
重要なことに、他のモダリティからの空間情報を使用して地域のテキストベースの条件付けを導くことにより、テキストプロンプトによって導入されることが多い幻覚を軽減します。
各モダリティのガイダンス強度は独立して制御することもでき、深さを介してボケを増やしたり、セグメンテーションを介してオブジェクトの隆起を調整するなど、さまざまな方向にステアリング出力が可能になります。
広範な実験は、私たちのモデルが最先端の生成的SISR法を上回り、優れた視覚的品質と忠実度を達成することを示しています。
https://mmsr.kfmei.com/のプロジェクトページを参照してください。

要約(オリジナル)

Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities — including depth, segmentation, edges, and text prompts — to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality’s guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.

arxiv情報

著者	Kangfu Mei,Hossein Talebi,Mojtaba Ardakani,Vishal M. Patel,Peyman Milanfar,Mauricio Delbracio
発行日	2025-03-18 17:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Power of Context: How Multimodality Improves Image Super-Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー