Inference Optimization of Foundation Models on AI Accelerators

要約

Transformer アーキテクチャを備えた大規模言語モデル (LLM) を含む強力な基盤モデルは、さまざまな業界に生成 AI の新時代をもたらしました。
業界と研究コミュニティは、これらの基礎モデルに基づいた多数の新しいアプリケーションを目撃してきました。
このようなアプリケーションには、質疑応答、顧客サービス、画像とビデオの生成、コード補完などが含まれます。
ただし、モデルパラメーターの数が数千億に達するため、現実世界のシナリオでは、そのデプロイメントに法外な推論コストと長い遅延が発生します。
その結果、AI アクセラレーターを使用した、コスト効率が高く、高速な推論に対する需要がこれまで以上に高まっています。
この目的を達成するために、私たちのチュートリアルでは、AI アクセラレータを使用した補完的な推論最適化手法についての包括的な議論を提供します。
基本的な Transformer アーキテクチャと深層学習システムフレームワークの概要から始めて、高速でメモリ効率の高いアテンション計算のためのシステム最適化手法を深く掘り下げ、それらを AI アクセラレータに効率的に実装する方法について説明します。
次に、高速トランスフォーマー推論の鍵となるアーキテクチャ要素について説明します。
最後に、同じコンテキストでさまざまなモデル圧縮と高速デコード戦略を検証します。

要約(オリジナル)

Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context.

arxiv情報

著者	Youngsuk Park,Kailash Budhathoki,Liangfu Chen,Jonas Kübler,Jiaji Huang,Matthäus Kleindessner,Jun Huan,Volkan Cevher,Yida Wang,George Karypis
発行日	2024-10-01 17:10:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inference Optimization of Foundation Models on AI Accelerators

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー