Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

要約

大規模な言語モデル（LLM）は、静的トレーニングデータへの依存により、幻覚と時代遅れの知識に苦しんでいます。
検索された生成（RAG）は、事実上の接地を改善するために外部の動的情報を統合することにより、これらの問題を軽減します。
マルチモーダル学習の進歩により、マルチモーダルラグは、テキスト、画像、オーディオ、ビデオなどの複数のモダリティを組み込んで生成された出力を強化することにより、このアプローチを拡張します。
ただし、クロスモーダルのアライメントと推論は、ユニモーダルラグのものを超えた独自の課題をもたらします。
この調査では、データセット、ベンチマーク、メトリック、評価、方法論、および検索、融合、増強、および生成の革新をカバーするマルチモーダルRAGシステムの構造的かつ包括的な分析を提供します。
トレーニング戦略、堅牢性の強化、損失関数、エージェントベースのアプローチをレビューしながら、多様なマルチモーダルRAGシナリオも調査します。
さらに、この進化する分野での研究を導くために、オープンな課題と将来の方向性の概要を説明します。
この調査では、マルチモーダルの動的な外部知識ベースを効果的に活用する、より能力があり、信頼性の高いAIシステムを開発するための基盤を築きます。
すべてのリソースは、https://github.com/llm-lab-org/multimodal-rag-surveyで公開されています。

要約(オリジナル)

Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. All resources are publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

arxiv情報

著者	Mohammad Mahdi Abootorabi,Amirhosein Zobeiri,Mahdi Dehghani,Mohammadali Mohammadkhani,Bardia Mohammadi,Omid Ghahroodi,Mahdieh Soleymani Baghshah,Ehsaneddin Asgari
発行日	2025-06-02 17:15:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー