Q-VLM: Post-training Quantization for Large Vision-Language Models

要約

この論文では、効率的なマルチモーダル推論のための大規模ビジョン言語モデル (LVLM) のトレーニング後の量子化フレームワークを提案します。
従来の量子化手法は、活性化離散化誤差を最小限に抑えて層ごとの丸め関数を逐次探索するため、層間の依存性を考慮せずに最適な量子化戦略を取得できません。
逆に、視覚言語モデル全体の離散化誤差に大きな影響を与える層間依存関係をマイニングし、この依存関係を低探索コストで最適な量子化戦略探索に埋め込みます。
具体的には、活性化エントロピーと出力離散化誤差に関する層間依存性との間に強い相関関係があることが観察されます。
したがって、ブロックを最適に分割するためのプロキシとしてエントロピーを使用します。これは、離散化エラーと検索コストの間で満足のいくトレードオフを達成することを目的としています。
さらに、ビジュアルエンコーダを最適化して、探索空間をきめ細かく分解するための層間の依存関係を解消し、量子化精度を損なうことなく探索コストをさらに削減します。
実験結果は、私たちの方法が、多様なマルチモーダル推論タスクでパフォーマンスを低下させることなく、メモリを 2.78 倍圧縮し、約 13B LLaVA モデルの生成速度を 1.44 倍向上させることを示しています。
コードは https://github.com/ChangyuanWang17/QVLM で入手できます。

要約(オリジナル)

In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.

arxiv情報

著者	Changyuan Wang,Ziwei Wang,Xiuwei Xu,Yansong Tang,Jie Zhou,Jiwen Lu
発行日	2024-11-15 13:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Q-VLM: Post-training Quantization for Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー