Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

要約

トランスフォーマーは、ビデオのスナップショット圧縮イメージング (SCI) の逆問題を解決する上で最先端のパフォーマンスを達成しました。SCI の不適切な姿勢の原因は、空間マスキングと時間的エイリアシングの混合劣化にあります。
しかし、以前のトランスフォーマーには劣化に対する洞察が欠けており、そのためパフォーマンスと効率が限られていました。
この研究では、初期レイヤでの時間的集約を行わずに効率的な再構成アーキテクチャと、ビルディングブロックとして Hierarchical Separable Video Transformer (HiSViT) を調整します。
HiSViT は、クロススケール分離可能マルチヘッドセルフアテンション (CSS-MSA) と高密度接続を備えたゲート自己変調フィードフォワードネットワーク (GSM-FFN) の複数のグループによって構築されており、それぞれが個別のチャネル部分内で実行されます。
マルチスケールのインタラクションや長距離モデリングのために、異なるスケールで。
CSS-MSA は、空間操作を時間操作から分離することで、計算オーバーヘッドを節約しながら、フレーム間ではなくフレーム内により多くの注意を払うという誘導バイアスを導入します。
GSM-FFN は、ゲート機構と因数分解された時空間畳み込みによって局所性を強化するように設計されています。
広範な実験により、私たちの方法は同等以下の複雑さとパラメータで $>\!0.5$ dB だけ以前の方法よりも優れていることが実証されました。
ソースコードと事前トレーニング済みモデルは https://github.com/pwangcs/HiSViT でリリースされています。

要約(オリジナル)

Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN is design to enhance the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by $>\!0.5$ dB with comparable or fewer complexity and parameters. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.

arxiv情報

著者	Ping Wang,Yulun Zhang,Lishun Wang,Xin Yuan
発行日	2024-07-16 17:35:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー