Video Understanding with Large Language Models: A Survey

要約

オンラインビデオプラットフォームの急成長とビデオコンテンツの量の増加に伴い、熟練したビデオ理解ツールに対する需要が著しく高まっています。
主要な言語タスクにおいて顕著な機能を示すラージ言語モデル (LLM) を利用して、この調査では、LLM (Vid-LLM) の力を活用したビデオ理解における最近の進歩の詳細な概要を提供します。
Vid-LLM の新たな能力は驚くほど高度で、特に常識的な知識と組み合わせたオープンエンドの時空間推論の能力は、将来のビデオ理解への有望な道筋を示唆しています。
Vid-LLM の固有の特性と機能を検証し、アプローチを LLM ベースのビデオエージェント、Vid-LLM 事前トレーニング、Vid-LLM 命令チューニング、およびハイブリッドメソッドの 4 つの主要なタイプに分類します。
さらに、この調査では、Vid-LLM のタスクとデータセット、および評価に採用された方法論に関する包括的な調査も示されています。
さらに、この調査では、さまざまなドメインにわたる Vid-LLM の広範なアプリケーションを調査し、それによって、現実世界のビデオ理解における課題に対処する際の Vid-LLM の驚くべき拡張性と多用途性を示しています。
最後に、この調査では、既存の Vid-LLM の限界と将来の研究の方向性がまとめられています。
詳細については、https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding のリポジトリにアクセスすることをお勧めします。

要約(オリジナル)

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. With Large Language Models (LLMs) showcasing remarkable capabilities in key language tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey also presents a comprehensive study of the tasks and datasets for Vid-LLMs, along with the methodologies employed for evaluation. Additionally, the survey explores the expansive applications of Vid-LLMs across various domains, thereby showcasing their remarkable scalability and versatility in addressing challenges in real-world video understanding. Finally, the survey summarizes the limitations of existing Vid-LLMs and the directions for future research. For more information, we recommend readers visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

arxiv情報

著者	Yunlong Tang,Jing Bi,Siting Xu,Luchuan Song,Susan Liang,Teng Wang,Daoan Zhang,Jie An,Jingyang Lin,Rongyi Zhu,Ali Vosoughi,Chao Huang,Zeliang Zhang,Feng Zheng,Jianguo Zhang,Ping Luo,Jiebo Luo,Chenliang Xu
発行日	2023-12-29 01:56:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Understanding with Large Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー