Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

要約

マルチモーダル感情分析（MSA）は、感情を認識するためにマルチモーダル情報を統合する急速に発展している分野であり、既存のモデルはこの分野で大きな進歩を遂げている。MSAにおける中心的な課題はマルチモーダル融合であり、これは主にマルチモーダル変換器（MulTs）によって対処されている。パラダイムとして機能するものの、MulTsは効率性の懸念に悩まされている。本研究では、効率最適化の観点から、MulTsが階層的モーダルワイズヘテロジニアスグラフ（HMHG）であることを提案・証明し、MulTsのグラフ構造表現パターンを導入する。このパターンに基づき、我々は、グラフ構造化・インターレースマスク化マルチモーダル変換器(GsiT)を設計するためのインターレースマスク(IM)機構を提案する。GsiTは形式的にはMulTsと等価であり、IMにより情報の乱れを伴わない効率的な重み共有機構を実現し、純粋なMulTsの1/3のパラメータでAll-Modal-In-Oneフュージョンを可能にする。Decompositionと呼ばれるTritonカーネルは、追加の計算オーバーヘッドを確実に回避するために実装されています。さらに、従来のMulTsよりも大幅に高い性能を達成している。GsiT自身とHMHGコンセプトの有効性をさらに検証するために、複数の最先端モデルに統合し、広く使用されているMSAデータセットで顕著な性能向上とパラメータ削減を実証する。

要約(オリジナル)

Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.

arxiv情報

著者	Yijie Jin,Junjie Peng,Xuanchao Lin,Haochen Yuan,Lan Wang,Cangzhi Zheng
発行日	2025-05-02 07:18:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー