LLaVA-OneVision: Easy Visual Task Transfer

要約

データ、モデル、視覚表現に関する洞察を統合して開発されたオープン大規模マルチモーダルモデル (LMM) のファミリーである LLaVA-OneVision を、LLaVA-NeXT ブログシリーズで紹介します。
私たちの実験結果は、LLaVA-OneVision が、単一イメージ、マルチイメージ、ビデオシナリオという 3 つの重要なコンピュータービジョンシナリオにおいて、オープン LMM のパフォーマンス限界を同時に押し上げることができる最初の単一モデルであることを示しています。
重要なのは、LLaVA-OneVision の設計により、さまざまなモダリティ/シナリオにわたる強力な転移学習が可能になり、新たな機能が生み出されるということです。
特に、画像からビデオへのタスク転送を通じて、強力なビデオ理解とクロスシナリオ機能が実証されます。

要約(オリジナル)

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

arxiv情報

著者	Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Yanwei Li,Ziwei Liu,Chunyuan Li
発行日	2024-08-06 17:59:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-OneVision: Easy Visual Task Transfer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー