FFN Fusion: Rethinking Sequential Computation in Large Language Models

要約

並列化の自然な機会を特定して活用することにより、大規模な言語モデルの順次計算を削減するアーキテクチャ最適化手法であるFFN Fusionを紹介します。
私たちの重要な洞察は、フィードフォワードネットワーク（FFN）層のシーケンス、特に特定の注意層の除去後に残っている層は、多くの場合、最小限の精度の影響と並行できることです。
このようなシーケンスを識別して融合し、モデルの動作を維持しながら推論潜時を大幅に削減する並列操作に変換するための原則的な方法論を開発します。
これらの手法をLlama-3.1-405B-Instructに適用すると、Llama-nemotron-ultra-253b-base（ultra-253b-base）を作成します。
49Bから253Bのパラメーターまでのモデルに関する広範な実験を通じて、FFN融合がより大きなスケールでますます効果的になり、量子化や剪定などの既存の最適化技術を補完できることを実証します。
最も興味深いことに、注意とFFN層の両方を含む完全な変圧器ブロックでさえ、神経アーキテクチャの設計の新しい方向を示唆していることがあることがわかります。

要約(オリジナル)

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

arxiv情報

著者	Akhiad Bercovich,Mohammad Dabbah,Omri Puny,Ido Galil,Amnon Geifman,Yonatan Geifman,Izhak Golan,Ehud Karpas,Itay Levy,Zach Moshe,Najeeb Nabwani,Tomer Ronen,Itamar Schen,Elad Segal,Ido Shahaf,Oren Tropp,Ran Zilberstein,Ran El-Yaniv
発行日	2025-03-24 17:20:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FFN Fusion: Rethinking Sequential Computation in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー