GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

要約

ビデオインスタンスセグメンテーション (VIS) の最近の傾向では、複雑で長いビデオシーケンスをモデル化するオンライン手法への依存が高まっています。
しかし、特にオクルージョンや突然の変化の際のオンライン手法の表現の劣化とノイズの蓄積は、大きな課題を引き起こします。
トランスフォーマーベースのクエリ伝播は、二次メモリの注意を犠牲にして有望な方向性を提供します。
ただし、上記の課題によりインスタンスの機能が低下する可能性があり、連鎖的な影響を受けます。
このようなエラーの検出と修正は、ほとんど研究されていないままです。
この目的を達成するために、\textbf{V} アイデア \textbf{I} インスタンス \textbf{S} の \textbf{G} 化 \textbf{R} 残存 \textbf{Att} オプションを導入します。
セグメンテーション。
まず、Gumbel-Softmax ベースのゲートを利用して、現在のフレームで考えられるエラーを検出します。
次に、ゲートのアクティブ化に基づいて、過去の表現から劣化した特徴を修正します。
このような残りの構成により、専用メモリの必要性が軽減され、関連するインスタンス機能の継続的なストリームが提供されます。
第二に、自己注意のためのマスクとしてゲート活性化を使用する新しいインスタンス間インタラクションを提案します。
このマスキング戦略は、セルフアテンションにおける非代表的なインスタンスのクエリを動的に制限し、長期的な追跡に必要な重要な情報を保存します。
ゲートされた残留接続とマスクされたセルフアテンションのこの新しい組み合わせを \textbf{GRAtt} ブロックと呼びます。これは、既存の伝播ベースのフレームワークに簡単に統合できます。
さらに、GRAtt ブロックはアテンションのオーバーヘッドを大幅に削減し、動的時間モデリングを簡素化します。
GRatt-VIS は、YouTube-VIS および非常に困難な OVIS データセット上で最先端のパフォーマンスを実現し、以前の方法に比べて大幅に向上しています。
コードは \url{https://github.com/Tanveer81/GRAttVIS} で入手できます。

要約(オリジナル)

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce \textbf{GRAtt-VIS}, \textbf{G}ated \textbf{R}esidual \textbf{Att}ention for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as \textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods. Code is available at \url{https://github.com/Tanveer81/GRAttVIS}.

arxiv情報

著者	Tanveer Hannan,Rajat Koner,Maximilian Bernhard,Suprosanna Shit,Bjoern Menze,Volker Tresp,Matthias Schubert,Thomas Seidl
発行日	2023-05-26 17:10:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー