FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models

要約

テキストプロンプトチューニングは、ネットワークの重みを凍結しながら、ローカルクライアントデータの軽量入力トークン（またはプロンプト）を調整することにより、フェデレートラーニングのビジョン言語モデル（例：クリップ）を適応させます。
トレーニング後、プロンプトのみが、集約のために中央サーバーとクライアントによって共有されます。
ただし、テキストの迅速なチューニングは、既知の概念への過剰適合に苦労することが多く、記憶されたテキスト機能に過度に依存している可能性があり、その適応性を目に見えない概念に制限します。
この制限に対処するために、包括的なコンテキスト情報（クラスの画像コンディショニングされた機能とテキスト属性機能）のプロンプトを条件付けするフェデレーションマルチモーダルビジュアルプロンプトチューニング（FEDMVP）を提案します。
FEDMVPのコアには、クロスアテンションを通じてテキストと視覚の特徴を相乗的に整列させ、より豊かなコンタキとの統合を可能にするプロンプトファーダーモジュールがあります。
次に、動的に生成されたマルチモーダル視覚プロンプトは、クリップの凍結ビジョンエンコーダーに入力され、クリップの類似性損失と一貫性の損失の組み合わせでトレーニングされます。
3つの一般化設定にまたがる20のデータセットでの広範な評価は、FedMVPが分散クラスとドメインのパフォーマンスを保持するだけでなく、最先端の方法と比較した場合、目に見えないクラスとドメインに高い一般化可能性を表示することを示しています。
コードは受け入れられるとリリースされます。

要約(オリジナル)

Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information — image-conditioned features and textual attribute features of a class — that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods. Codes will be released upon acceptance.

arxiv情報

著者	Mainak Singha,Subhankar Roy,Sarthak Mehrotra,Ankit Jha,Moloud Abdar,Biplab Banerjee,Elisa Ricci
発行日	2025-04-29 15:36:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー