RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

要約

注意層ではなく、Feedforwardネットワーク（FFN）レイヤーがVision Transformer（VIT）の推論潜時の主な貢献者であり、モデルサイズが増加するにつれてその衝撃が意味することを明らかにします。
この発見は、FFN層に焦点を当てることにより、大規模なvitsの効率を最適化するための重要な機会を強調しています。
この作業では、テスト中に効率的なFFN層のトレーニング後の構造レパラメーター化を促進する新しいチャネルアイドルメカニズムを提案します。
具体的には、特徴チャネルのセットは、各FFN層の非線形活性化関数をアイドル状態に保ち、バイパスし、それにより、推論中に構造的な再分析を可能にする線形経路を形成します。
このメカニズムは、さまざまなVITで許容可能な犠牲（場合によっては利益）を伴う顕著な遅延削減を達成する、回復可能な視力変圧器（繰り返し）のファミリーをもたらします。
メソッドの利点は、モデルサイズと一貫してスケーリングされ、速度の向上の向上を実証し、より大きなモデルの精度のギャップまたはさらに高い精度を徐々に狭めます。
特に、Repa-vit-LargeとRepa-vit-Hugeは、同じトレーニング戦略の下でそれぞれ +1.7％および +1.1％のTOP-1精度で66.8％および68.7％のスピードアップを享受しています。
Repavitは、FFN層に構造的な再評価を採用して、私たちの最良の知識にVITを促進する最初のものであり、効率的なVITの縁起の良い方向を表していると考えています。
ソースコードは、https：//github.com/ackesnal/repavitで入手できます。

要約(オリジナル)

We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

arxiv情報

著者	Xuwei Xu,Yang Li,Yudong Chen,Jiajun Liu,Sen Wang
発行日	2025-06-02 06:39:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー