Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

要約

マルチモーダルビジョンランゲージモデル (VLM) は、コンピュータービジョンと自然言語処理が交わる革新的なテクノロジーとして登場し、機械が視覚的モダリティとテキストモダリティの両方を通じて世界を認識し、推論できるようにします。
たとえば、CLIP、Claude、GPT-4V などのモデルは、視覚データとテキストデータに対する強力な推論能力と理解能力を示し、ゼロショット分類では古典的な単一モダリティビジョンモデルを上回ります。
VLM の研究は急速に進歩し、アプリケーションでの人気が高まっているにもかかわらず、VLM に関する既存の研究の包括的な調査が特に不足しており、特に特定の領域で VLM を活用しようとしている研究者にとっては顕著です。
この目的を達成するために、以下の観点から VLM の体系的な概要を提供します。過去 5 年間 (2019 年から 2024 年) に開発された主要な VLM のモデル情報。
これらの VLM の主なアーキテクチャとトレーニング方法。
VLM の一般的なベンチマークと評価指標の概要と分類。
身体化されたエージェント、ロボット工学、ビデオ生成などの VLM のアプリケーション。
幻覚、公平性、安全性など、現在の VLM が直面している課題と問題。
論文やモデルリポジトリのリンクを含む詳細なコレクションは、https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git にリストされています。

要約(オリジナル)

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

arxiv情報

著者	Zongxia Li,Xiyang Wu,Hongyang Du,Huy Nghiem,Guangyao Shi
発行日	2025-01-04 04:59:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー