TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

要約

大規模な言語モデル（LLM）の推論能力は、重量を構造的に除去することで改善できますが、既存の手法は主にトランスブロックのフィードフォワードネットワーク（FFN）を除去することに焦点を当てており、トランスアーキテクチャのコアであるマルチヘッドの注意（MHA）ブロックを効率的に利用することはできません。
この問題に対処するために、私たちは、非常に核心に、マルチヘッドの緊張プロセスとタッカー分解を通じてMHA圧縮を実行する新しい直感的なフレームワークを提案します。
これにより、複数の注意ヘッドの重みに共有された高次元サブスペースを実施することにより、MHA重量の高次元構造化された除去と圧縮の両方が可能になります。
このアプローチは、複数のベンチマークデータセットでLLMの推論機能を一貫して強化し、エンコーダのみおよびデコーダーのみのアーキテクチャの両方で、MHA重量で最大$ \ SIM 250 $の圧縮率を達成することを実証します。
さらに、提案された方法は、LLMの推論パフォーマンスのさらなる改善を実現するために、既存のFFNのみベースの除去技術とシームレスに組み合わせることができることを示しています。

要約(オリジナル)

The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.

arxiv情報

著者	Yuxuan Gu,Wuyang Zhou,Giorgos Iacovides,Danilo Mandic
発行日	2025-05-15 12:42:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー