HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

要約

拡散変圧器（DIT）は生成タスクに優れていますが、推論コストが高いため、実用的な展開の課題に直面しています。
冗長計算を保存および取得する機能キャッシュは、加速の可能性を提供します。
既存の学習ベースのキャッシュは、適応性があるものの、以前のタイムステップの影響を見落としています。
また、トレーニングと推論の間に、誤った整列された目標（予測されるノイズ対高品質の画像が整合されている）にも苦しんでいます。
これらの2つの矛盾は、パフォーマンスと効率の両方を損ないます。
この目的のために、私たちはトレーニングと推論を、ハーモニカと呼ばれる新しい学習ベースのキャッシュフレームワークと調和させます。
まず、段階的な除去トレーニング（SDT）が組み込まれて、以前のステップを活用できる除去プロセスの連続性を確保します。
さらに、画像エラーのプロキシガイド目標（IEPO）が適用され、画像エラーを近似するために効率的なプロキシを介して画像品質のバランスをとります。
8ドルのモデル、4ドルのサンプラー、256ドルのTimes256 $から2K $の解像度にまたがる広範な実験は、フレームワークの優れたパフォーマンスとスピードアップを示しています。
たとえば、40ドル以上のレイテンシの削減（つまり、$ 2.07 \ Times $の理論的スピードアップ）を達成し、Pixart-$ \ Alpha $のパフォーマンスを向上させます。
驚くべきことに、私たちの画像のないアプローチにより、トレーニング時間は以前の方法と比較して25ドル\％$を短縮します。

要約(オリジナル)

Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives–aligned predicted noise vs. high-quality images–between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\%$ latency reduction (i.e., $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our image-free approach reduces training time by $25\%$ compared with the previous method.

arxiv情報

著者	Yushi Huang,Zining Wang,Ruihao Gong,Jing Liu,Xinjie Zhang,Jinyang Guo,Xianglong Liu,Jun Zhang
発行日	2025-01-31 14:26:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー