Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

要約

大規模な言語モデルは、広範囲にわたる無制限のタスクにわたって優れた普遍的な機能を実証し、その有用性をマルチモーダルな会話を包含するように拡張しました。
しかし、既存の方法では、特に視覚トークンが限られている場合、画像とビデオの両方の理解を効果的に処理する際に課題に直面しています。
この研究では、統一された視覚表現を通じて画像やビデオを含む会話を理解し、参加できる統一ビジョン言語モデルである Chat-UniVi を紹介します。
具体的には、一連の動的なビジュアルトークンを使用して、画像とビデオを均一に表現します。
この表現フレームワークにより、モデルは限られた数のビジュアルトークンを効率的に利用して、画像に必要な空間的詳細とビデオに必要な包括的な時間関係を同時にキャプチャできるようになります。
さらに、マルチスケール表現を活用し、モデルが高レベルの意味概念と低レベルの視覚的詳細の両方を認識できるようにします。
特に、Chat-UniVi は画像とビデオの両方を含む混合データセットでトレーニングされているため、修正を必要とせずに両方のメディアが関係するタスクに直接適用できます。
広範な実験結果は、統合モデルとしての Chat-UniVi が、画像またはビデオ専用に設計された既存の方法でさえも常に優れていることを示しています。

要約(オリジナル)

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a unified vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.

arxiv情報

著者	Peng Jin,Ryuichi Takanobu,Caiwan Zhang,Xiaochun Cao,Li Yuan
発行日	2023-11-14 10:11:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー