Vision-Language Models for Vision Tasks: A Survey

要約

ほとんどの視覚認識研究は、ディープニューラルネットワーク (DNN) トレーニングの群集ラベル付きデータに大きく依存しており、通常は単一の視覚認識タスクごとに DNN をトレーニングするため、手間と時間がかかる視覚認識パラダイムが生じます。
この 2 つの課題に対処するために、視覚言語モデル (VLM) が最近集中的に研究されています。これは、インターネット上でほぼ無限に入手可能な Web スケールの画像とテキストのペアから豊富な視覚言語相関関係を学習し、さまざまな条件でのゼロショット予測を可能にします。
単一の VLM を使用した視覚認識タスク。
この論文は、さまざまな視覚認識タスクのための視覚言語モデルの系統的なレビューを提供します。(1) 視覚認識パラダイムの開発を導入する背景。
(2) 広く採用されているネットワークアーキテクチャ、事前トレーニングの目的、下流のタスクをまとめた VLM の基礎。
(3) VLM の事前トレーニングと評価で広く採用されているデータセット。
(4) 既存の VLM 事前トレーニング方法、VLM 転移学習方法、および VLM 知識蒸留方法のレビューと分類。
(5) レビューされた手法のベンチマーク、分析、および議論。
(6) 視覚認識に関する将来の VLM 研究で追求される可能性のあるいくつかの研究課題と潜在的な研究方向。
この調査に関連するプロジェクトが https://github.com/jingyi0000/VLM_survey に作成されています。

要約(オリジナル)

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

arxiv情報

著者	Jingyi Zhang,Jiaxing Huang,Sheng Jin,Shijian Lu
発行日	2024-02-16 10:28:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language Models for Vision Tasks: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー