Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

要約

変圧器ベースの大手言語モデル（LLMS）は、生成タスクで印象的なパフォーマンスを示しますが、高価で計算最適化された加速器の非効率的な使用により、実際のサービスに重要な課題をもたらします。
LLM推論のさまざまなフェーズを分割するために、分解されたサービスアーキテクチャが提案されていますが、デコードフェーズの効率はまだ低いです。
これは、変圧器ベースのLLMのさまざまな演算子のさまざまなリソース要求によって引き起こされます。
具体的には、注意オペレーターはメモリ集約的であり、特に長いコンテキストリクエストのために、最新の加速器の強みと衝突するメモリアクセスパターンを示します。
LLMデコードの効率を向上させるために、モデル出席分解を導入します。
このアプローチは、モデルの他の部分にハイエンドの加速器を利用しながら、注意演算子向けの安価でメモリ最適化されたデバイスのコレクションを活用しています。
この不均一なセットアップにより、各コンポーネントが特定のワークロードに合わせて調整され、全体的なパフォーマンスとコスト効率が最大化されます。
当社の包括的な分析と実験は、複数のデバイスで注意計算を分割する実行可能性を確認します。
また、不均一なデバイス間で必要な通信帯域幅は、一般的なネットワーキングテクノロジーで管理可能であることが証明されています。
私たちの理論をさらに検証するために、分散された不均一なクラスターにモデル出席分解を組み込んだLLM推論システムであるLaminaを開発および展開します。
実験結果は、ラミナが同様のコストを持つ既存のソリューションよりも16.1〜90.1％の推定スループットを提供できることを示しています。

要約(オリジナル)

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests. To enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster. Experimental results indicate that Lamina can provide 16.1 ~ 90.1% higher estimated throughput than existing solutions with similar costs.

arxiv情報

著者	Shaoyuan Chen,Wencong Xiao,Yutong Lin,Mingxing Zhang,Yingdi Shan,Jinlei Jiang,Kang Chen,Yongwei Wu
発行日	2025-04-10 14:56:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー