Efficient Video Object Segmentation via Modulated Cross-Attention Memory

要約

最近、トランスフォーマーベースのアプローチは、半教師ありビデオオブジェクトのセグメンテーションに関して有望な結果を示しています。
ただし、これらのアプローチは、数フレームごとにメモリバンクを頻繁に拡張するため、GPU メモリの需要が増加するため、通常、長いビデオでは困難を伴います。
我々は、頻繁なメモリ拡張を必要とせずに時間的滑らかさをモデル化するために最適化された動的な長期変調クロスアテンション (MCA) メモリを導入する、MAVOS と呼ばれるトランスベースのアプローチを提案します。
提案された MCA は、ビデオの長さに関係なく一貫した速度を効率的に維持しながら、ローカルおよびグローバルの両方の特徴をさまざまな粒度レベルで効果的にエンコードします。
複数のベンチマーク、LVOS、長時間ビデオ、DAVIS 2017 での広範な実験により、長時間ビデオのセグメンテーション精度を低下させることなく、リアルタイム推論とメモリ需要の大幅な削減につながる、提案された貢献の有効性が実証されました。
既存の最良のトランスフォーマーベースのアプローチと比較して、当社の MAVOS は速度を 7.6 倍向上させ、同時に GPU メモリを 87% 大幅に削減し、短いビデオデータセットと長いビデオデータセットで同等のセグメンテーションパフォーマンスを実現します。
特に LVOS データセットでは、MAVOS は単一の V100 GPU で 37 フレーム/秒 (FPS) で動作しながら、63.3% の J&F スコアを達成しています。
私たちのコードとモデルは、https://github.com/Amshaker/MAVOS で公開されます。

要約(オリジナル)

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

arxiv情報

著者	Abdelrahman Shaker,Syed Talal Wasim,Martin Danelljan,Salman Khan,Ming-Hsuan Yang,Fahad Shahbaz Khan
発行日	2024-03-26 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー