Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

要約

近年、自己注意ブロックの設計を線形コスト推論（LCI）に変更することで、トランスフォーマ言語モデルの効率を向上させる複数のアーキテクチャが提案されている。この分野での注目すべきアプローチは、ステート・スペース・マシン（SSM）アーキテクチャであり、言語モデリングタスクにおいて自己注意トランスフォーマーと同等の性能を示した。しかし、このようなアーキテクチャの変更には、重みの完全な事前学習をゼロから行う必要があり、新しいアーキテクチャを使いたい研究者や実務家には大きなコストがかかる。より伝統的な線形注意の研究では、スワップ・アンド・フィネチューン（swap-and-finetune）フレームワークにより、線形注意で完全注意を近似することが提案されている。このアプローチに動機づけられ、我々は、レイアノーム、MLP、入出力エンベッディングのような、LCIと自己注意ベースの変換器の間で共有されるコンポーネントの重みを、既に訓練されたモデルパラメータから新しいアーキテクチャに直接転送する、クロスアーキテクチャ転送学習（XATL）を提案する。我々は、様々なサイズと代替的な注意アーキテクチャで本方法の有効性を実験し、同じ計算予算内で、LMベンチマークにおいて、ⅳmethodabbrがトレーニング時間を最大2.5倍まで大幅に短縮し、最大2.6%強力なモデルでより良い最小値に収束することを示す。

要約(オリジナル)

Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference (LCI). A notable approach in this realm is the State-Space Machines (SSMs) architecture, which showed on-par performance on language modeling tasks with the self-attention transformers. However, such an architectural change requires a full pretraining of the weights from scratch, which incurs a huge cost to researchers and practitioners who want to use the new architectures. In the more traditional linear attention works, it has been proposed to approximate full attention with linear attention by swap-and-finetune framework. Motivated by this approach, we propose Cross-Architecture Transfer Learning (XATL), in which the weights of the shared components between LCI and self-attention-based transformers, such as layernorms, MLPs, input/output embeddings, are directly transferred to the new architecture from already pre-trained model parameters. We experimented the efficacy of the method on varying sizes and alternative attention architectures and show that \methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.

arxiv情報

著者	Sehyun Choi
発行日	2024-04-03 12:27:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー