Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

要約

最近のVision Mamba（VIM）モデルは、シーケンスの長さがほぼ線形の複雑さを示し、視覚データの処理に非常に魅力的です。
ただし、トレーニング方法とその可能性はまだ十分に調査されていません。
この論文では、VIMの戦略を調査し、VIMトレーニングを効果的に改善できる新しい正則化方法である確率的層ごとのシャッフル（SLW）を提案します。
アーキテクチャの変更がなければ、このアプローチにより、非階層的なVIMは、同様のタイプのカウンターパートと比較して、ImagENET-1Kで主要なパフォーマンスを得ることができます。
私たちの方法は、レイヤーごとに4つの簡単なステップを使用して動作します。レイヤー依存のシャッフルレートを割り当てる確率割り当て、ベルヌーリトライアルによる操作サンプリング、入力トークンのシーケンスシャッフル、出力の回復の順序です。
SLWは、3つの原則を際立たせます。
\ textit {（2）シンプルだが効果的：} 4段階のプロセスは、ランダムな順列と無視できるオーバーヘッドのみを導入します。
\ textIT {（3）直感的な設計：}シャッフル確率は、レイヤー深度とともに直線的に成長し、ビジョンモデルの階層セマンティック抽象化と整合します。
私たちの仕事は、VIMモデルのテーラードトレーニング戦略の重要性を強調し、それらのスケーラビリティを探求するための役立つ方法を提供します。

要約(オリジナル)

Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: \textit{(1) Plug-and-play:} No architectural modifications are needed, and it is deactivated during inference. \textit{(2) Simple but effective:} The four-step process introduces only random permutations and negligible overhead. \textit{(3) Intuitive design:} Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability.

arxiv情報

著者	Zizheng Huang,Haoxing Chen,Jiaqi Li,Jun Lan,Huijia Zhu,Weiqiang Wang,Limin Wang
発行日	2025-06-02 08:41:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー