M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまなモダリティにわたる目に見えないタスクに対するゼロショット汎化機能の強化にますます重点を置いており、幅広いドメインにわたって優れたパフォーマンスを示します。
命令チューニングは、さまざまなマルチモーダルタスクで事前トレーニングされたモデルを微調整することにより、ゼロショット汎化を達成するための効果的な戦略として浮上しました。
MLLM の規模が拡大し続けるにつれて、パラメータ効率の高い微調整がますます重要になります。
しかし、既存のパラメータ効率の高いアプローチのほとんどは単一モダリティのみに焦点を当てており、微調整中にマルチモーダル特性を見落とすことがよくあります。
この研究では、MLLM の効率的な命令チューニングのための新しいマルチモーダルプロンプトチューニング (M$^2$PT) アプローチを紹介します。
M$^2$PT は、微調整中に視覚的プロンプトとテキストプロンプトをそれぞれビジョンエンコーダと言語プロセッサに効果的に統合し、モダリティ全体での特徴の抽出と調整を容易にします。
さまざまなマルチモーダル評価データセットに関する実証結果は、いくつかの最先端のベースラインと比較して、私たちのアプローチの優れたパフォーマンスを示しています。
一連の包括的なアブレーション研究により、当社の即時設計の有効性とアプローチの効率性が検証されています。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

arxiv情報

著者	Taowen Wang,Yiyang Liu,James Chenhao Liang,junhan zhao,Yiming Cui,Yuning Mao,Shaoliang Nie,Jiahao Liu,Fuli Feng,Zenglin Xu,Cheng Han,Lifu Huang,Qifan Wang,Dongfang Liu
発行日	2024-09-27 16:24:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー