Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

要約

マスクされたビデオモデリング〜（MVM）は、視覚基盤モデルの非常に効果的なトレーニング前戦略として浮上しており、モデルは可視トークンからの情報を使用してマスクされた空間的トークンを再構築します。
ただし、このようなアプローチの重要な課題は、適切なマスキング戦略を選択することにあります。
以前の研究では、ランダムおよびチューブベースのマスキングなどの事前定義されたマスキングテクニックや、外部の事前訓練モデルからのキーモーションプライアー、光学フロー、セマンティックキューを活用するアプローチを調査しています。
この作業では、トークンのモーションダイナミクスをモデル化し、マスクされた自動エンコーダー（MAE）フレームワークにビデオでモーション中心のトークンを選択することができる、斬新で一般化可能な軌跡を意識する適応トークンサンプラー（TATS）を紹介します。
さらに、近位政策最適化（PPO）を使用して、MAEとTATの両方をゼロから共同最適化できるようにする統一されたトレーニング戦略を提案します。
私たちのモデルは、アクション認識の下流のタスクでパフォーマンスを損なうことなく攻撃的なマスキングを可能にしながら、トレーニング前のメモリ効率を保証することを可能にします。
V2、Kinetics-400、UCF101、およびHMDB51を含む、4つのベンチマークにわたる提案されたアプローチの広範な実験は、他の最先端の方法と比較して、作業の有効性、転送可能性、一般化、および効率性を示しています。

要約(オリジナル)

Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.

arxiv情報

著者	Ayush K. Rai,Kyle Min,Tarun Krishna,Feiyan Hu,Alan F. Smeaton,Noel E. O’Connor
発行日	2025-05-13 13:35:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー