Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

要約

検索エンジンでは未知の情報をテキストで検索することができます。
ただし、モデルがこれまで見たことのないオブジェクトを識別するなど、見慣れない視覚コンテンツを理解する場合、従来の方法では不十分です。
この課題は、大規模ビジョン言語モデル (VLM) で特に顕著です。モデルが画像に描かれたオブジェクトにさらされていない場合、その画像に関するユーザーの質問に対して信頼できる回答を生成するのに苦労します。
さらに、新しいオブジェクトやイベントが継続的に出現するため、VLM を頻繁に更新することは計算負荷が大きいため現実的ではありません。
この制限に対処するために、VLM と Web エージェント間のコラボレーションを促進する新しいフレームワークである Vision Search Assistant を提案します。
このアプローチは、VLM の視覚的理解機能と Web エージェントのリアルタイム情報アクセスを利用して、Web 経由でオープンワールドの検索拡張生成を実行します。
このコラボレーションを通じて視覚的表現とテキスト表現を統合することで、システムにとって画像が新しい場合でも、モデルは情報に基づいた応答を提供できます。
オープンセットとクローズドセットの両方の QA ベンチマークで行われた広範な実験により、Vision Search Assistant が他のモデルよりも大幅に優れており、既存の VLM に広く適用できることが実証されました。

要約(オリジナル)

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user’s question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs’ visual understanding capabilities and web agents’ real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

arxiv情報

著者	Zhixin Zhang,Yiyuan Zhang,Xiaohan Ding,Xiangyu Yue
発行日	2024-10-28 17:04:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー