Efficient Few-Shot Continual Learning in Vision-Language Models

要約

Vision-Language Models（VLM）は、視覚的な質問応答や画像キャプションなどのタスクで優れています。
ただし、VLMは、クリップなどの前提条件の画像エンコーダーを使用することで制限されることが多く、全体的なパフォーマンスを妨げる画像理解エラーにつながります。
それに加えて、実際のアプリケーションは、多くの場合、新しい、しばしば限られたデータが継続的に到着するにつれて、モデルを継続的に適合させる必要があることがよくあります。
これに対処するために、VLMS内の画像エンコーダーを選択的に更新するための堅牢で計算効率の良い方法であるLORSU（構造化された更新による低ランク適応）を提案します。
Lorsuは、構造化されたローカライズされたパラメーターの更新を導入し、モデルの一般的な堅牢性を維持しながら、以前にエラーが発生しやすいデータのパフォーマンスを効果的に修正します。
当社のアプローチは、理論的な洞察を活用して、最も重要なパラメーターのみを識別および更新し、重要なリソース効率を達成しています。
具体的には、パフォーマンスを犠牲にすることなく、完全なVLM更新と比較して、Lorsuが計算オーバーヘッドを25倍以上削減することを実証します。
少数のショットの継続的な学習設定でのVQAタスクに関する実験結果は、Lorsuのスケーラビリティ、効率、および有効性を検証し、リソースに制約のある環境での画像エンコーダー適応の魅力的なソリューションになります。

要約(オリジナル)

Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model’s general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU’s scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.

arxiv情報

著者	Aristeidis Panos,Rahaf Aljundi,Daniel Olmeda Reino,Richard E. Turner
発行日	2025-02-07 13:35:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Few-Shot Continual Learning in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー