Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

要約

GPT-4 などの大規模ビジョン言語モデル (VLM) は、さまざまなマルチモーダルタスクにわたって優れたパフォーマンスを達成しました。
ただし、VLM の展開には、大量のエネルギー消費と計算リソースが必要になります。
攻撃者が VLM の推論中に悪意を持って高いエネルギー消費と待ち時間 (エネルギー待ち時間コスト) を引き起こすと、計算リソースが枯渇します。
このペーパーでは、VLM の可用性に関するこの攻撃対象領域を調査し、VLM の推論中に高いエネルギー遅延コストを誘発することを目的としています。
VLM の推論中の高いエネルギー待ち時間コストは、生成されるシーケンスの長さを最大化することで操作できることがわかりました。
この目的を達成するために、推論中に VLM に長い文を生成させるための知覚できない摂動を作成することを目的として、冗長な画像を提案します。
具体的には、3 つの損失目標を設計します。
まず、損失はシーケンス終了 (EOS) トークンの発生を遅らせるために提案されています。EOS トークンは、VLM がさらなるトークンの生成を停止するための信号です。
さらに、不確実性損失とトークンダイバーシティ損失は、それぞれ、生成された各トークンの不確実性と、生成されたシーケンス全体のすべてのトークン間の多様性を高めるために提案されており、トークンレベルとシーケンスレベルで出力の依存関係を解消できます。
さらに、これらの損失を効果的にバランスさせることができる時間的重み調整アルゴリズムが提案されています。
広範な実験により、MS-COCO および ImageNet データセット上の元の画像と比較して、冗長画像により生成されるシーケンスの長さが 7.87 倍および 8.56 倍増加する可能性があることが実証されており、これはさまざまなアプリケーションに潜在的な課題をもたらします。
私たちのコードは https://github.com/KuofengGao/Verbose_Images で入手できます。

要約(オリジナル)

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.

arxiv情報

著者	Kuofeng Gao,Yang Bai,Jindong Gu,Shu-Tao Xia,Philip Torr,Zhifeng Li,Wei Liu
発行日	2024-03-22 15:31:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー