PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

要約

Flamingo や GPT-4V などのビジョン言語モデル (VLM) は、大規模な言語モデルをビジョンシステムと統合することにより、計り知れない可能性を示しています。
それにもかかわらず、これらのモデルは、明示的な空間的根拠のないほとんどのキャプションを含むマルチモーダルデータでトレーニングされるため、オブジェクトの位置特定という基本的なコンピュータービジョンタスクにおいて課題に直面しています。
VLM と統合されるバウンディングボックスアノテーションを使用してカスタムの教師ありトレーニングパイプラインを構築することは可能ですが、その結果、特殊化されたスケールが難しいモデルが作成されます。
このペーパーでは、キャプションベースの VLM の限界を探ることを目的としており、その代わりに、i) キャプションベースの VLM の重みを凍結したままにし、ii) 教師付き検出データを使用しないという、より簡単な方法でこの課題に取り組むことを提案します。
この目的を達成するために、学習可能な空間プロンプトである入力に依存しない位置挿入 (PIN) を導入します。このプロンプトには、フリーズされた VLM 内でスライドされる最小限のパラメーターセットが含まれており、オブジェクトローカリゼーション機能のロックが解除されます。
私たちの PIN モジュールは、新しい出力ヘッドを導入することなく、合成データに対する単純な次のトークン予測タスクでトレーニングされます。
私たちの実験では、Pascal VOC、COCO、LVIS、絵画や漫画などの多様な画像を含む、さまざまな画像に対して強力なゼロショットローカリゼーションパフォーマンスを実証しています。

要約(オリジナル)

Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.

arxiv情報

著者	Michael Dorkenwald,Nimrod Barazani,Cees G. M. Snoek,Yuki M. Asano
発行日	2024-02-13 18:39:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー