Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

要約

マルチモーダル拡散トランス（MM-DITS）は、テキスト駆動型の視覚生成において顕著な進歩を遂げました。
ただし、Fluxのような最先端のMM-DITモデルでさえ、テキストプロンプトと生成されたコンテンツ間の正確なアライメントの達成に苦労しています。
MM-DITの注意メカニズムにおける2つの重要な問題、すなわち1）視覚的モダリティとテキストモダリティと2）アライメントを妨げるタイムステップに付随する注意の重み付けの欠如によるトークンの不均衡によるクロスモーダルの注意の抑制を特定します。
これらの問題に対処するために、\ textBf {温度調整されたクロスモーダル注意（TACA）}を提案します。これは、温度スケーリングとタイムステップ依存の調整を介したマルチモーダル相互作用を動的にリバランスするパラメーター効率の高い方法です。
Loraの微調整と組み合わせると、TACAはT2I-Compbenchベンチマークでのテキストイメージのアラインメントを最小限の計算オーバーヘッドで大幅に強化します。
FluxやSD3.5などの最先端のモデルでTACAをテストし、オブジェクトの外観、属性結合、および空間的関係に関して画像テキストアライメントを改善する能力を実証しました。
私たちの調査結果は、テキストから画像への拡散モデルのセマンティックフィデリティを改善する上で、クロスモーダルの注意のバランスをとることの重要性を強調しています。
私たちのコードは\ href {https://github.com/vchitect/taca}で公開されています

要約(オリジナル)

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

arxiv情報

著者	Zhengyao Lv,Tianlin Pan,Chenyang Si,Zhaoxi Chen,Wangmeng Zuo,Ziwei Liu,Kwan-Yee K. Wong
発行日	2025-06-09 17:54:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー