Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

要約

人工知能における変圧器モデルの複雑さの増大により、計算コスト、メモリ使用量、エネルギー消費量が増大します。
ハードウェアアクセラレーションは、変圧器モデルに合わせてプロセッサとアクセラレータを設計し、計算ホットスポットを高効率でサポートすることで、その後の課題に取り組みます。
ただし、メモリ帯域幅がハードウェアアクセラレータの改善を妨げる可能性があります。
このような背景に対して、この論文では、ハードウェアアクセラレータのカーネルサイズによって制御され、オフチップデータアクセスを効果的に最小限に抑える新しいメモリ配置戦略を提案します。
この構成は、計算の大部分が一般行列乗算 (GEMM) 演算に基づいているエンドツーエンドの変換器モデル推論に特に有益です。
さらに、このメモリデータ配置の範囲内で、トランスフォーマーモデルにおける非 GEMM 操作のオーバーヘッドにも対処します。
私たちの調査では、シングルコアシステムとマルチコアシステムの両方で、提案されているアクセラレータ駆動のデータ配置アプローチの実装と有効性を調査しています。
私たちの評価では、最先端のトランスフォーマーを使用して推論を実行すると、私たちのアプローチが最大 2.8 倍の速度向上を達成できることが実証されました。

要約(オリジナル)

The increasing complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. Hardware acceleration tackles the ensuing challenges by designing processors and accelerators tailored for transformer models, supporting their computation hotspots with high efficiency. However, memory bandwidth can hinder improvements in hardware accelerators. Against this backdrop, in this paper we propose a novel memory arrangement strategy, governed by the hardware accelerator’s kernel size, which effectively minimizes off-chip data access. This arrangement is particularly beneficial for end-to-end transformer model inference, where most of the computation is based on general matrix multiplication (GEMM) operations. Additionally, we address the overhead of non-GEMM operations in transformer models within the scope of this memory data arrangement. Our study explores the implementation and effectiveness of the proposed accelerator-driven data arrangement approach in both single- and multi-core systems. Our evaluation demonstrates that our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.

arxiv情報

著者	Alireza Amirshahi,Giovanni Ansaloni,David Atienza
発行日	2023-12-20 13:01:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー