QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries

要約

視覚言語モデル（VLM）のトレーニングに使用される大規模なインターネットデータと、ロボットによって収集された生の画像ストリームの間には、ドメインシフトが存在します。
既存の適応戦略には、さまざまな自然言語クエリに対応する必要があるロボットにとっては非現実的なクラスの閉鎖セットの定義が必要です。
これに応じて、QueryAdapterを提示します。
自然言語クエリに応じて事前に訓練されたVLMを迅速に適応させるための新しいフレームワーク。
QueryAdapterは、以前の展開中に収集された非標識データをレバレッジして、VLM機能をクエリに関連するセマンティッククラスに合わせます。
学習可能なプロンプトトークンを最適化し、トレーニング用のオブジェクトを積極的に選択することにより、数分で適応したモデルを作成できます。
また、適応のために実際のデータを使用する場合、クエリとは無関係のオブジェクトをどのように処理するかを探ります。
次に、オブジェクトキャプションをネガティブクラスラベルとして使用することを提案し、適応中により適切な校正された信頼性スコアを作成するのに役立ちます。
Scannet ++での広範な実験は、QueryAdapterが最先端の非監視されていないVLMアダプターおよび3Dシーングラフメソッドと比較して、オブジェクトの検索パフォーマンスを大幅に向上させることを示しています。
さらに、このアプローチは、抽象的なアフォーダンスクエリやEGO4Dなどのその他のデータセットに堅牢な一般化を示します。

要約(オリジナル)

A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.

arxiv情報

著者	Nicolas Harvey Chapman,Feras Dayoub,Will Browne,Christopher Lehnert
発行日	2025-02-26 01:07:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー