RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

要約

視覚的アブダクティブ推論は、視覚的な観察に対して可能性の高い説明を行うことを目的としています。
我々は、シンプルでありながら効果的な領域条件適応を提案します。これは、局所的な視覚的手がかりから説明を推測する機能をフリーズした CLIP に装備する、ハイブリッドパラメータ効率の良い微調整方法です。
「ローカルヒント」と「グローバルコンテキスト」をCLIPモデルのビジュアルプロンプトに、細かいレベルと粗いレベルで別々にエンコードします。
アダプターは下流のタスク用に CLIP モデルを微調整するために使用され、私たちは新しいアテンションアダプターを設計します。これは、トレーニング可能なクエリと凍結された CLIP モデルのキープロジェクションを使用してアテンションマップの焦点を直接操作します。
最後に、修正されたコントラスト損失を使用して新しいモデルをトレーニングし、文字通りの説明ともっともらしい説明の特徴に向かって視覚的特徴を同時に回帰させます。
喪失により、CLIP は知覚能力と推論能力の両方を維持できるようになります。
Sherlock の視覚的アブダクティブ推論ベンチマークの実験では、RCA が以前の SOTA よりも大幅に優れており、リーダーボードで \n{1} 位にランクされていることが示されました (例: Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58、高い = 優れています)。
また、RCA が RefCOCO のようなローカル認識ベンチマークに一般化可能であることも検証します。
私たちはプロジェクトを \textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}} でオープンソース化しています。

要約(オリジナル)

Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode “local hints” and “global contexts” into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}.

arxiv情報

著者	Hao Zhang,Yeo Keat Ee,Basura Fernando
発行日	2024-08-07 13:44:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー