APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

要約

大規模な言語モデル（LLM）アプリケーションを進めるためには、長いコンテキストの推論が重要ですが、そのPrefill速度は依然として重要なボトルネックのままです。
シーケンス並列性戦略や近似の注意メカニズムを介した計算削減などの現在のアプローチは、最適な推論効率を提供することに依然として不足しています。
これにより、入力をより長いシーケンスにスケーリングし、タイムリーに長いコンテキストクエリを処理できます。
これに対処するために、マルチホストのおおよその注意をレバレバルする効率的な長いコンテキスト推論フレームワークであるAPBを紹介します。
APBは、シーケンス並列性フレームワーク内で重要なキー価値ペアの通信メカニズムを導入し、タスクのパフォーマンスを維持しながらより速い推論速度を可能にします。
最適化された分布戦略とともにテーラードFlashAttNカーネルを組み込んでAPBを実装し、多様なモデルと並列性構成をサポートします。
APBは、観察可能なタスクパフォーマンスの低下なしに、それぞれFlashattn、Ringattn、およびStarattnと比較して、最大9.2x、4.2x、および1.6xのスピードアップを達成します。
https://github.com/thunlp/apbでAPBの実装および実験コードを提供します。

要約(オリジナル)

While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.

arxiv情報

著者	Yuxiang Huang,Mingye Li,Xu Han,Chaojun Xiao,Weilin Zhao,Sun Ao,Hao Zhou,Jie Zhou,Zhiyuan Liu,Maosong Sun
発行日	2025-02-17 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー