Efficient Inference for Large Language Model-based Generative Recommendation

要約

大規模言語モデル (LLM) ベースの生成推奨は顕著な成功を収めていますが、その実際の導入は、特に自己回帰デコードによって引き起こされる過剰な推論遅延によりコストがかかります。
ロスレス LLM デコードの高速化では、投機的デコード (SD) が有望なソリューションとして浮上しています。
ただし、SD を生成推奨に適用すると、ビーム検索によって上位 K 個の項目 (つまり、K 個の異なるトークンシーケンス) を推奨リストとして生成する必要があるため、特有の課題が生じます。
これにより、SD ではより厳格な検証が行われ、ターゲット LLM からのすべての上位 K シーケンスが、各デコードステップでドラフトモデルによって正常にドラフトされる必要があります。
これを軽減するために、1) ドラフトモデルとターゲット LLM の間の上位 K シーケンスのアラインメントを強化すること、2) 検証戦略を緩和して些細な LLM 呼び出しを減らすことを検討します。
この目的を達成するために、我々は AtSpeed という名前のアライメントフレームワークを提案します。これは、厳格な Top-K 検証の下で Top-K アライメントのための AtSpeed-S 最適化目標を提示します。
さらに、緩和されたサンプリング検証戦略を導入し、高確率で上位 K 以外のドラフトシーケンスを受け入れられるようにし、LLM 呼び出しを大幅に削減します。
これに対応して、この緩和されたサンプリング検証の下で、トップ K アライメント用の AtSpeed-R を提案します。
2 つの実際のデータセットに関する実験結果は、AtSpeed が LLM ベースの生成推奨を大幅に加速することを示しています。たとえば、厳格な Top-K 検証では 2 倍近くの高速化、緩和されたサンプリング検証では最大 2.5 倍の高速化が見られます。
コードとデータセットは近い将来リリースされる予定です。

要約(オリジナル)

Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.

arxiv情報

著者	Xinyu Lin,Chaoqun Yang,Wenjie Wang,Yongqi Li,Cunxiao Du,Fuli Feng,See-Kiong Ng,Tat-Seng Chua
発行日	2024-10-07 16:23:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Inference for Large Language Model-based Generative Recommendation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー