Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

要約

マルチモーダルビジョン言語モデル（VLM）は、コンピュータービジョンと自然言語処理の交差点で変革的な技術として浮上しており、視覚的およびテキストモダリティの両方を通じて、機械が世界について認識し、推論できるようになりました。
たとえば、Clip、Claude、GPT-4Vなどのモデルは、視覚データおよびテキストデータの強力な推論と理解能力を示し、ゼロショット分類で古典的な単一モダリティビジョンモデルを打ち負かします。
研究の急速な進歩とアプリケーションの人気の高まりにもかかわらず、特に特定のドメインのVLMを活用することを目的とした研究者にとって、VLMに関する既存の研究に関する包括的な調査が特に不足しています。
この目的のために、以下の側面におけるVLMの体系的な概要を提供します。過去5年間（2019-2024）に開発された主要なVLMのモデル情報。
これらのVLMの主要なアーキテクチャとトレーニング方法。
VLMの一般的なベンチマークと評価メトリックの要約と分類。
具体化されたエージェント、ロボット工学、ビデオ生成を含むVLMのアプリケーション。
幻覚、公平性、安全など、現在のVLMが直面する課題と問題。
論文やモデルリポジトリリンクを含む詳細なコレクションは、https：//github.com/zli12321/awesome-vlm-papers-and-models.gitにリストされています。

要約(オリジナル)

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

arxiv情報

著者	Zongxia Li,Xiyang Wu,Hongyang Du,Huy Nghiem,Guangyao Shi
発行日	2025-01-29 00:26:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー