Multimodal Token Fusion for Vision Transformers

要約

画像のような入力ソースを処理するために自己注意モジュールが積み重ねられるシングルモーダルビジョンタスクに対処するために、トランスフォーマーの多くの適応が出現しました。
直感的には、複数のモダリティのデータをビジョントランスフォーマーに供給することでパフォーマンスを向上させることができますが、内部モーダルの注意深い重みも希釈される可能性があり、最終的なパフォーマンスを損なう可能性があります。
この論文では、トランスフォーマーベースの視覚タスクに合わせたマルチモーダルトークン融合法（TokenFusion）を提案します。
複数のモダリティを効果的に融合するために、TokenFusionは情報のないトークンを動的に検出し、これらのトークンを投影および集約されたインターモーダル機能に置き換えます。
残留位置アライメントも採用されており、融合後のインターモーダルアライメントを明示的に利用できます。
TokenFusionの設計により、トランスフォーマーはマルチモーダル機能間の相関関係を学習できますが、シングルモーダルトランスフォーマーアーキテクチャはほとんどそのままです。
さまざまな同種および異種のモダリティで広範な実験が行われ、TokenFusionが3つの典型的なビジョンタスク（マルチモーダル画像から画像への変換、RGB深度セマンティックセグメンテーション、および3Dオブジェクト検出）で最先端の方法を超えることを示しています。
点群と画像。
私たちのコードはhttps://github.com/yikaiw/TokenFusionで入手できます。

要約(オリジナル)

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

arxiv情報

著者	Yikai Wang,Xinghao Chen,Lele Cao,Wenbing Huang,Fuchun Sun,Yunhe Wang
発行日	2022-07-15 11:00:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Token Fusion for Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー