Attamba: Attending To Multi-Token States

要約

シーケンス内の次のトークンを予測する場合、バニラトランスフォーマーは以前のすべてのトークンに対してアテンションを計算し、その結果、シーケンスの長さに応じて計算が 2 次スケーリングされます。
状態空間モデルは、トークンのシーケンス全体を固定次元の表現に圧縮して効率を向上させますが、他のアーキテクチャは、低ランクの射影またはシーケンス全体にわたるまばらなアテンションパターンによって二次二次の複雑さを実現します。
この論文では、状態空間モデルを使用してトークンのチャンクを圧縮し、これらの圧縮されたキーと値の表現に注目する新しいアーキテクチャである Attamba を紹介します。
トランスフォーマー内のキーとバリューのプロジェクションを SSM に置き換えると、モデルの品質が向上し、柔軟なトークンチャンクが可能になり、その結果、同様の KV キャッシュとアテンションフットプリントのトランスフォーマーとの複雑さが 24% 改善され、KV キャッシュとアテンションが最大 4 倍小さくなることがわかりました。
5% の複雑さのトレードオフの FLOP。
Attamba は、可変長のチャンク化されたシーケンスに対してアテンションを実行できるため、二次スケーリングと線形スケーリング間のスムーズな移行が可能になり、適応可能な効率向上が実現します。

要約(オリジナル)

When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.

arxiv情報

著者	Yash Akhauri,Safeen Huda,Mohamed S. Abdelfattah
発行日	2024-11-26 18:52:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attamba: Attending To Multi-Token States

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー