Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

要約

大規模言語モデル（LLM）をエッジデバイス上に展開することは、計算上の制約、メモリ制限、推論速度、エネルギー消費により、重大な課題となる。モデルの量子化は、モデルサイズと計算オーバーヘッドを削減することで、効率的なLLM推論を可能にする重要な技術として浮上している。本研究では、Ollamaライブラリに含まれる28個の量子化LLMの包括的な解析を行い、デフォルトでPost-Training Quantization (PTQ)とweight-only quantization技術を適用し、エッジデバイス（4GB RAM搭載のRaspberry Pi 4）に導入しました。複数の量子化レベルとタスクタイプにわたって、エネルギー効率、推論性能、出力精度を評価する。モデルは、5つの標準化されたデータセット（CommonsenseQA、BIG-Bench Hard、TruthfulQA、GSM8K、HumanEval）でベンチマークされ、高解像度のハードウェアベースのエネルギー測定ツールを使用して、実世界の消費電力をキャプチャします。その結果、さまざまな量子化設定におけるエネルギー効率、推論速度、精度のトレードオフが明らかになり、リソースに制約のある環境でのLLM展開を最適化する構成が浮き彫りになりました。ハードウェアレベルのエネルギープロファイリングとLLMベンチマークを統合することで、本研究は、持続可能なAIのための実用的な洞察を提供し、エネルギーを考慮したLLM展開に関する既存の研究における重要なギャップを埋める。

要約(オリジナル)

Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.

arxiv情報

著者	Erik Johannes Husom,Arda Goknil,Merve Astekin,Lwin Khin Shar,Andre Kåsen,Sagar Sen,Benedikt Andreas Mithassel,Ahmet Soylu
発行日	2025-04-04 11:29:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー