Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

要約

LlavaやQwen-VLのような生成的大規模マルチモーダルモデル（LMM）は、多種多様なビジョン言語（VL）タスクで優れています。
パフォーマンスが強いにもかかわらず、LMMSの生成出力は、画像分類や複数選択VQAなどのビジョン言語分類タスク（つまり、ビジョン言語入力と個別のラベルを備えたタスク）に特化していません。
これらのタスクにLMMSを利用する上での重要な課題の1つは、生成LMMから有用な機能の抽出です。
これを克服するために、LMMの潜在空間からのマルチモーダル機能抽出を活用するアプローチを提案します。
この目的に向かって、LMMSのまばらな注意ヘッドのアクティブ化（ヘッドの5％未満）を強力な特徴表現として活用する微妙なメソッド（SAVS）を提示します。
少数のショットの例しかないため、SAVは、ビジョン言語分類タスクのコレクションで、さまざまな少数のショットおよび微調整されたベースラインと比較して、最先端のパフォーマンスを示しています。
また、私たちの実験は、SAVが追加の例でパフォーマンスをスケーリングし、同様のタスクに一般化し、効果的で堅牢なマルチモーダル機能表現の両方としてSAVを確立できることを意味します。

要約(オリジナル)

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs’ generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM’s latent space. Toward this end, we present Sparse Attention Vectors (SAVs) — a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of vision-language classification tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

arxiv情報

著者	Chancharik Mitra,Brandon Huang,Tianning Chai,Zhiqiu Lin,Assaf Arbelle,Rogerio Feris,Leonid Karlinsky,Trevor Darrell,Deva Ramanan,Roei Herzig
発行日	2025-06-09 17:01:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー