LongVILA: Scaling Long-Context Visual Language Models for Long Videos

要約

ロングコンテクスト機能は、マルチモーダル基礎モデル、特に長時間のビデオ理解にとって重要である。我々は、アルゴリズムとシステムを共同設計することにより、ロングコンテクスト視覚言語モデルのためのフルスタックソリューションであるLongVILAを導入する。モデルトレーニングのために、我々は既存のVLMをアップグレードし、2つの追加ステージ、すなわち、ロングコンテクスト拡張とロングビデオ教師付き微調整を組み込むことで、ロングビデオ理解をサポートする。しかし、長時間のビデオに対する学習は計算量とメモリ消費量が大きい。我々は、長時間のビデオ学習と推論を効率的に並列化し、勾配チェックポイントを行うことなく、256GPUで2Mコンテキスト長の学習を可能にする、長コンテキスト多重モードシーケンス並列（MM-SP）システムを導入する。LongVILAは、VILAのビデオフレーム数を8から2048に効率的に拡張し、ロングビデオキャプションのスコアを2.00から3.26（5点満点）に向上させ、6000フレーム（100万トークン以上）のビデオ針刺しで99.8%の精度を達成した。LongVILA-7Bは、VideoMMEベンチマークにおいて、字幕付きで61.8%という高い精度を示している。また、MM-SPはリング型シーケンス並列よりも2.1倍から5.7倍高速であり、ハイブリッドコンテキストとテンソル並列を用いたMegatronよりも1.1倍から1.4倍高速である。さらに、Hugging Face Transformersとシームレスに統合されている。

要約(オリジナル)

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on the VideoMME benchmark, i.e., 61.8% with subtitle. Besides, MM-SP is 2.1x – 5.7x faster than ring style sequence parallelism and 1.1x – 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

arxiv情報

著者	Fuzhao Xue,Yukang Chen,Dacheng Li,Qinghao Hu,Ligeng Zhu,Xiuyu Li,Yunhao Fang,Haotian Tang,Shang Yang,Zhijian Liu,Ethan He,Hongxu Yin,Pavlo Molchanov,Jan Kautz,Linxi Fan,Yuke Zhu,Yao Lu,Song Han
発行日	2024-11-01 10:57:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー