UniViTAR: Unified Vision Transformer with Native Resolution

要約

従来のビジョントランスは、入力解像度を標準化することにより視覚モデリングを簡素化し、しばしば自然な視覚データの変動性を無視し、空間的コンテキストの忠実度を損なうことができます。
予備的な調査では、本面的にネイティブ解像度のモデリングが調査されていますが、既存のアプローチには視覚的表現の観点からの系統的分析が依然として欠けています。
このギャップを埋めるために、マルチモーダルの時代に統一された視覚モダリティとネイティブ解像度のシナリオに合わせた均一なビジョンファンデーションモデルのファミリーであるUnivitarを紹介します。
当社のフレームワークは、最初に、複数の高度なコンポーネントを統合することにより、バニラパラダイムのアーキテクチャのアップグレードを実施します。
これらの改善に基づいて、2つのコアメカニズムを戦略的に組み合わせたプログレッシブトレーニングパラダイムが導入されます。（1）解像度カリキュラム学習、固定解像度の前登録からネイティブ解像度のチューニングへの移行により、VITの固有の適応性を可変長シーケンスに活用し、（2）視覚モダリティの適応を強化しました。
並行して、ハイブリッドトレーニングフレームワークは、凍結した教師モデルからの特徴蒸留とのシグモイドベースのコントラスト損失をさらに相乗的にし、それにより初期段階の収束を加速させます。
最後に、パブリックデータセットのみで訓練された、0.3Bから1Bの複数のモデルスケールにわたる外部実験は、その有効性を示しています。

要約(オリジナル)

Conventional Vision Transformer simplifies visual modeling by standardizing input resolutions, often disregarding the variability of natural visual data and compromising spatial-contextual fidelity. While preliminary explorations have superficially investigated native resolution modeling, existing approaches still lack systematic analysis from a visual representation perspective. To bridge this gap, we introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. Our framework first conducts architectural upgrades to the vanilla paradigm by integrating multiple advanced components. Building upon these improvements, a progressive training paradigm is introduced, which strategically combines two core mechanisms: (1) resolution curriculum learning, transitioning from fixed-resolution pretraining to native resolution tuning, thereby leveraging ViT’s inherent adaptability to variable-length sequences, and (2) visual modality adaptation via inter-batch image-video switching, which balances computational efficiency with enhanced temporal reasoning. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model, thereby accelerating early-stage convergence. Finally, trained exclusively on public datasets, externsive experiments across multiple model scales from 0.3B to 1B demonstrate its effectiveness.

arxiv情報

著者	Limeng Qiao,Yiyang Gan,Bairui Wang,Jie Qin,Shuang Xu,Siqi Yang,Lin Ma
発行日	2025-04-02 14:59:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UniViTAR: Unified Vision Transformer with Native Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー