Energy Considerations of Large Language Model Inference and Efficiency Optimizations

要約

大規模な言語モデル（LLM）が規模と採用が拡大するにつれて、計算および環境コストが上昇し続けています。
以前のベンチマークの取り組みは、主に理想化された設定のレイテンシの削減に焦点を当てており、多くの場合、エネルギー使用を形成する多様な現実世界の推論ワークロードを見落としています。
この作業では、多様な自然言語処理（NLP）と、会話型AIやコード生成を含む生成人工知能（AI）ワークロード全体にわたる一般的な推論効率の最適化のエネルギーへの影響を体系的に分析します。
入出力トークン分布とバッチサイズのバリエーションのためのビニング戦略を通じて、実際のLLMワークフローに近似するモデリングアプローチを導入します。
当社の経験的分析は、ソフトウェアフレームワーク、デコード戦略、GPUアーキテクチャ、オンラインおよびオフラインのサービング設定、およびモデルの並列性構成に及びます。
推論の最適化の有効性は、ワークロードジオメトリ、ソフトウェアスタック、ハードウェアアクセラレータに非常に敏感であることを示しており、フロップまたは理論的なGPU使用率に基づく素朴なエネルギー推定値が、実際のエネルギー消費を著しく過小評価していることを示しています。
私たちの調査結果は、関連する推論効率の最適化を適切に適用すると、最適化されていないベースラインから総エネルギー使用量を最大73％削減できることが明らかになりました。
これらの洞察は、持続可能なLLM展開の基盤を提供し、将来のAIインフラストラクチャのためのエネルギー効率の高い設計戦略を通知します。

要約(オリジナル)

As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.

arxiv情報

著者	Jared Fernandez,Clara Na,Vashisth Tiwari,Yonatan Bisk,Sasha Luccioni,Emma Strubell
発行日	2025-04-24 15:45:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Energy Considerations of Large Language Model Inference and Efficiency Optimizations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー