Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

要約

大規模言語モデル (LLM) の成功により、Gemini-pro などの大規模マルチモーダルモデル (LMM) の開発も並行して進み、さまざまなアプリケーションを変革し始めています。
これらの洗練されたマルチモーダルモデルは、複雑なデータを解釈して分析するように設計されており、以前は達成できなかった規模でテキスト情報と視覚情報の両方を統合し、さまざまなアプリケーションに新しい道を開きます。
この論文では、重要なセキュリティ課題に対処する際の、即時に設計された Gemini-pro LMM と微調整されたビジョントランスフォーマー (ViT) モデルの適用性と有効性を調査します。
私たちは 2 つの異なるタスクに焦点を当てています。1 つは、潜在的なバックドアを示す画像内の小さな四角形などの単純なトリガーを検出するという視覚的に明らかなタスクと、視覚的には明らかではないが、視覚的表現を通じてマルウェアを分類するタスクです。
私たちの結果は、微調整された ViT モデルと比較した場合、Gemini-pro は精度と信頼性の点で劣っており、パフォーマンスに大きな差異があることを浮き彫りにしています。
一方、ViT モデルは優れた精度を示し、両方のタスクでほぼ完璧なパフォーマンスを達成します。
この研究は、サイバーセキュリティアプリケーションにおける即時設計 LMM の長所と限界を示すだけでなく、正確で信頼性の高いタスクに対する微調整された ViT モデルの比類のない有効性も強調しています。

要約(オリジナル)

The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.

arxiv情報

著者	Fouad Trad,Ali Chehab
発行日	2024-03-26 15:20:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー