Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework

要約

このペーパーでは、モデルのスケーラビリティを維持しながら計算効率を向上させることを目的とした、変圧器の能力のある断面混合物（MOE）アーキテクチャの理論的枠組みを紹介します。
トークンの埋め込み全体を選択した専門家にルーティングする従来のMOEモデルとは異なり、私たちのアプローチは、各トークンの表現のセグメントを献身的な専門家に割り当てます。
トークン表現の損失と戦うために、エクササイズ前の変圧器層を利用して、トークン全体で注意を再計算し、シーケンスの長さの次元を減らします。
専門家の数とモデルの寸法、シーケンス長、システムオーバーヘッドなどの要因との間に非線形関係があるという最適なスケーリング法則を導き出すことにより、理論を拡張します。
これらの製剤は、特定のアーキテクチャおよびハードウェアの制約の下で最適な専門家数を識別するための閉じた形式と数値的に溶接性のある式をもたらします。
その結果、私たちのフレームワークは、さまざまなフレームワークでコンピューティング効率の理論的境界を提供するだけでなく、大きなモデルを効果的にスケーリングするための実用的な設計選択もガイドします。
経験的検証は保留中ですが、将来の仕事におけるフレームワークの効率、スケーラビリティ、および実用性を評価するために、包括的な実験的ロードマップを提示します。

要約(オリジナル)

This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability. Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself — assigning segments of each token’s representation to dedicated experts. To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality. We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead. These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints. As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively. While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework’s efficiency, scalability, and practicality in future work.

arxiv情報

著者	Soham Sane
発行日	2025-03-26 17:33:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー