Find Everything: A General Vision Language Model Approach to Multi-Object Search

要約

多オブジェクト探索（MOS）問題は、移動コストを最小化しつつ、目標オブジェクトを発見する可能性を最大化するために、一連の場所をナビゲートすることを含む。本論文では、視覚言語モデル(VLM)を活用し、多様な環境下で複数のオブジェクトを見つける、ファインダーと呼ばれるMOS問題への新しいアプローチを紹介する。具体的には、我々のアプローチは、シーンレベルとオブジェクトレベルの意味的相関を組み合わせたスコアマップ技術とともに、ナビゲーション中に複数のオブジェクトを同時に追跡し、推論するためのマルチチャンネルスコアマップを導入する。シミュレーションと実世界設定の両方における実験により、Finderは深層強化学習とVLMを用いた既存の手法を凌駕することが示された。また、アブレーションとスケーラビリティの研究により、我々の設計の選択と、ターゲットオブジェクトの数が増加した場合の頑健性がそれぞれ検証された。ウェブサイト：https://find-all-my-things.github.io/

要約(オリジナル)

The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website: https://find-all-my-things.github.io/

arxiv情報

著者	Daniel Choi,Angus Fung,Haitong Wang,Aaron Hao Tan
発行日	2025-03-02 00:07:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Find Everything: A General Vision Language Model Approach to Multi-Object Search

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー