Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

要約

ロボットビジョンは、マルチモーダル融合技術と視覚言語モデル（VLM）の進歩から大きな恩恵を受けている。我々は、意味的なシーン理解、同時定位とマッピング（SLAM）、3Dオブジェクト検出、ナビゲーションと定位、ロボット操作などの主要なロボットビジョンタスクにおけるマルチモーダル融合のアプリケーションを体系的にレビューする。我々は、大規模言語モデル（LLM）に基づくVLMを従来のマルチモーダル融合手法と比較し、その利点、限界、相乗効果を分析する。さらに、一般的に使用されているデータセットの詳細な分析を行い、実世界のロボットシナリオにおける適用可能性と課題を評価する。さらに、クロスモーダルアライメント、効率的な融合戦略、リアルタイム展開、ドメイン適応などの重要な研究課題を特定し、ロバストなマルチモーダル表現のための自己教師付き学習、トランスフォーマベースの融合アーキテクチャ、スケーラブルなマルチモーダルフレームワークなどの将来の研究の方向性を提案する。包括的なレビュー、比較分析、将来を見据えた議論を通じて、ロボットビジョンにおけるマルチモーダル知覚とインタラクションを発展させるための貴重な参考文献を提供する。本サーベイの包括的な研究リストは、https://github.com/Xiaofeng-Han-Res/MF-RV。

要約(オリジナル)

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

arxiv情報

著者	Xiaofeng Han,Shunpeng Chen,Zenghuang Fu,Zhe Feng,Lue Fan,Dong An,Changwei Wang,Li Guo,Weiliang Meng,Xiaopeng Zhang,Rongtao Xu,Shibiao Xu
発行日	2025-04-03 10:53:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー