Distilling Textual Priors from LLM to Efficient Image Fusion

要約

マルチモダリティ画像Fusionは、複数のソース入力からの単一の包括的な画像を合成することを目的としています。
CNNやGANなどの従来のアプローチは、効率を提供しますが、低品質または複雑な入力を処理するのに苦労しています。
テキスト誘導方法の最近の進歩は、これらの制限を克服するために大規模なモデルのプライアーを活用していますが、メモリと推論時間の両方で重要な計算オーバーヘッドを犠牲にしています。
この課題に対処するために、大規模なモデルの事前に蒸留するための新しいフレームワークを提案し、モデルサイズを劇的に削減しながら、推論中のテキストガイダンスの必要性を排除します。
当社のフレームワークは、教師の学生アーキテクチャを利用しています。教師ネットワークには、大規模なモデルが組み込まれ、この知識を調整された蒸留プロセスを介して小規模な学生ネットワークに転送します。
さらに、空間チャネルの交差融合モジュールを導入して、空間的ディメンションとチャネル次元の両方でテキストプライアーを活用するモデルの能力を高めます。
私たちの方法は、計算効率と融合品質の間の好ましいトレードオフを達成します。
教師ネットワークのパラメーターと推論時間の10％のみを必要とする蒸留ネットワークは、そのパフォーマンスの90％を保持し、既存のSOTAメソッドを上回ります。
広範な実験は、私たちのアプローチの有効性を示しています。
実装は、オープンソースリソースとして公開されます。

要約(オリジナル)

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model’s ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

arxiv情報

著者	Ran Zhang,Xuanhua He,Ke Cao,Liu Liu,Li Zhang,Man Zhou,Jie Zhang
発行日	2025-04-14 14:47:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Textual Priors from LLM to Efficient Image Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー