Video Understanding with Large Language Models: A Survey

要約

オンラインビデオプラットフォームの急成長とビデオコンテンツの急増に伴い、熟練したビデオ理解ツールの需要が著しく高まっている。大規模言語モデル（LLM）の言語およびマルチモーダルタスクにおける顕著な能力を考慮し、この調査では、LLM（Vid-LLM）の力を活用したビデオ理解における最近の進歩の詳細な概要を提供する。Vid-LLMの出現した能力は驚くほど高度であり、特に、常識的知識と組み合わせたオープンエンドな空間-時間推論能力は、将来のビデオ理解の有望な道筋を示唆している。我々は、Vid-LLMのユニークな特徴と能力を検証し、アプローチを4つの主要なタイプに分類する：LLMベースのビデオエージェント、Vid-LLMsプレトレーニング、Vid-LLMsインストラクションチューニング、ハイブリッド手法である。さらに、このサーベイでは、Vid-LLMsのタスク、データセット、評価方法論の包括的な研究を提示している。さらに、様々な領域におけるVid-LLMの広範な応用を探求し、実世界のビデオ理解課題におけるVid-LLMの顕著なスケーラビリティと汎用性を強調している。最後に、既存のVid-LLMの限界をまとめ、今後の研究の方向性を概説する。詳細については、https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding のリポジトリを参照されたい。

要約(オリジナル)

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey presents a comprehensive study of the tasks, datasets, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

arxiv情報

著者	Yunlong Tang,Jing Bi,Siting Xu,Luchuan Song,Susan Liang,Teng Wang,Daoan Zhang,Jie An,Jingyang Lin,Rongyi Zhu,Ali Vosoughi,Chao Huang,Zeliang Zhang,Feng Zheng,Jianguo Zhang,Ping Luo,Jiebo Luo,Chenliang Xu
発行日	2024-01-04 03:08:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Video Understanding with Large Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー