Cost-Optimal Grouped-Query Attention for Long-Context LLMs

要約

効果的かつ効率的な変圧器ベースの大手言語モデル（LLMS）の構築は最近、モデル言語機能を最大化し、トレーニングと展開コストを最小限に抑える必要がある研究の焦点となりました。
既存の取り組みは、主にモデルのパフォーマンス、パラメーターサイズ、データサイズの複雑な関係を説明し、LLMSをトレーニングするための最適な計算割り当てを検索しました。
ただし、トレーニングと推論に対するコンテキストの長さと注意ヘッドの構成（グループ化されたクエリの注意の数とキー価値ヘッドの数）の影響を見落としています。
このホワイトペーパーでは、モデルのパフォーマンス、計算コスト、およびメモリコストの観点から、モデルを異なるパラメーターサイズ、コンテキストの長さ、および注意ヘッド構成と体系的に比較します。
次に、トレーニングと推論の両方でコスト最適LLMの構築を導くために、パラメーターサイズとトレーニングコンピューティングのみに基づいた既存のスケーリング方法を拡張します。
私たちの定量的スケーリング研究は、十分に長いシーケンスを処理すると、注意ヘッドが少ない大きなモデルが低い損失を達成しながら、計算コストとメモリコストが低いことが示されています。
私たちの調査結果は、特に長いコンテスト処理シナリオで、実用的なLLMを開発するための貴重な洞察を提供します。
コードとデータを公開します。

要約(オリジナル)

Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.

arxiv情報

著者	Yingfa Chen,Yutong Wu,Xu Han,Zhiyuan Liu,Maosong Sun
発行日	2025-03-12 17:50:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cost-Optimal Grouped-Query Attention for Long-Context LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー