Efficient Video Object Segmentation via Modulated Cross-Attention Memory

要約

最近、変換器ベースのアプローチは、半教師付きビデオオブジェクトセグメンテーションに有望な結果を示している。しかし、これらのアプローチは通常、数フレームごとにメモリバンクを頻繁に拡張するため、GPUのメモリ要求が増大し、長い動画では苦戦を強いられる。我々は、最適化された動的な長期変調交差注意（MCA）メモリを導入することで、頻繁なメモリ拡張を必要とせずに時間的平滑性をモデル化する、MAVOSと名付けられた変換器ベースのアプローチを提案する。提案するMCAは、ビデオの長さに関わらず一貫した速度を効率的に維持しながら、様々な粒度レベルで局所的特徴と大域的特徴の両方を効果的に符号化する。LVOS、Long-Time Video、DAVIS 2017といった複数のベンチマークを用いた広範な実験により、長時間の動画におけるセグメンテーション精度を落とすことなく、リアルタイム推論とメモリ需要の顕著な低減につながる我々の提案する貢献の有効性が実証された。既存の最良の変換器ベースのアプローチと比較して、我々のMAVOSは7.6倍高速化し、同時にGPUメモリを87%大幅に削減し、短い動画と長い動画のデータセットで同等のセグメンテーション性能を実現した。特にLVOSデータセットにおいて、我々のMAVOSはシングルV100 GPUで37フレーム/秒(FPS)で動作しながら63.3%のJ&Fスコアを達成しました。我々のコードとモデルは、https://github.com/Amshaker/MAVOS で公開される予定です。

要約(オリジナル)

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

arxiv情報

著者	Abdelrahman Shaker,Syed Talal Wasim,Martin Danelljan,Salman Khan,Ming-Hsuan Yang,Fahad Shahbaz Khan
発行日	2024-09-02 20:58:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー