MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

要約

大規模な言語モデルの計算の複雑さを軽減するために、リニアアテンションやフラッシュアテンションなどの変換モデルの効率を向上させるために多大な努力が払われてきました。
ただし、モデルのサイズとそれに対応する計算の複雑さは、より高いパフォーマンスを追求するために常にスケールアップされています。
この研究では、新しい観点から計算の複雑さ (FLOP) を大幅に削減する新しいトランスフォーマーアーキテクチャである MemoryFormer を紹介します。
マルチヘッドアテンション操作に必要な計算を除いて、変圧器モデルのほぼすべての計算を削除します。
これは、完全に接続されたレイヤーの線形投影を置き換える特徴変換の代替方法を利用することによって可能になります。
具体的には、まず、線形投影で使用される重み行列を置き換えるために、大量の離散ベクトルを格納するメモリ内ルックアップテーブルのグループを構築します。
次に、ハッシュアルゴリズムを使用して、入力埋め込みに基づいてベクトルの相関サブセットを動的に取得します。
取得されたベクトルが結合されて出力埋め込みが形成され、全結合層での行列乗算演算の結果の推定が提供されます。
行列の乗算を実行するのと比較して、メモリからデータブロックを取得することは、ほとんど計算を必要とせず、はるかに安価な操作です。
私たちは MemoryFormer をゼロからトレーニングし、さまざまなベンチマークで広範な実験を行って、提案されたモデルの有効性を実証します。

要約(オリジナル)

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.

arxiv情報

著者	Ning Ding,Yehui Tang,Haochen Qin,Zhenli Zhou,Chao Xu,Lin Li,Kai Han,Heng Liao,Yunhe Wang
発行日	2024-11-20 02:41:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー