ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

要約

取得したデータを使用した最近の軽量画像キャプションモデルは、主にテキストプロンプトに焦点を当てています。
ただし、以前の作品では、取得したテキストをテキストプロンプトとしてのみ利用しており、視覚情報は CLIP 視覚埋め込みのみに依存していました。
この問題により、プロンプトに固有の画像の説明が視覚的な埋め込み空間に十分に反映されないという制限があります。
この問題に取り組むために、私たちは、軽量の画像キャプション用の新しい検索テキストベースの視覚的プロンプトである ViPCap を提案します。
ViPCap は、取得したテキストと画像情報を視覚的なプロンプトとして活用し、関連する視覚情報をキャプチャするモデルの機能を強化します。
テキストプロンプトを CLIP 空間にマッピングし、複数のランダム化されたガウス分布を生成することにより、私たちの方法はサンプリングを利用してランダムに拡張された分布を調査し、画像情報を含む意味論的な特徴を効果的に取得します。
これらの取得された特徴は画像に統合され、視覚的なプロンプトとして指定され、COCO、Flickr30k、NoCaps などのデータセットのパフォーマンスの向上につながります。
実験結果は、ViPCap が効率と有効性において以前の軽量キャプションモデルよりも大幅に優れていることを示し、プラグアンドプレイソリューションの可能性を示しています。

要約(オリジナル)

Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that contain image information. These retrieved features are integrated into the image and designated as the visual prompt, leading to performance improvements on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution.

arxiv情報

著者	Taewhan Kim,Soeun Lee,Si-Woo Kim,Dong-Jin Kim
発行日	2024-12-30 05:07:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー