HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

要約

言語モデルの進歩に伴い、統一されたマルチモーダル理解と生成は、モデルアーキテクチャが分離したコンポーネントから統一された単一モデルフレームワークへと進化し、大きな進歩を遂げている。本稿では、統一的なマルチモーダル理解と生成のための単一変換器を構築するための効率的な学習パラダイムを探求する。具体的には、能力を拡張するための事前知識を利用したマルチモーダルウォームアップ戦略を提案する。クロスモーダル互換性の課題に対処するため、特徴量の事前スケーリングとマルチモーダルAdaLN技術を導入する。提案された技術を統合し、新しい単一マルチモーダル変換器であるHaploOmniを発表する。限られた学習コストで、HaploOmniは複数の画像・動画像理解・生成ベンチマークにおいて、先進的な統一モデルを凌駕する性能を達成する。すべてのコードはhttps://github.com/Tencent/HaploVLM。

要約(オリジナル)

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.

arxiv情報

著者	Yicheng Xiao,Lin Song,Rui Yang,Cheng Cheng,Zunnan Xu,Zhaoyang Zhang,Yixiao Ge,Xiu Li,Ying Shan
発行日	2025-06-03 15:14:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー