GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis

要約

マルチモーダル感情分析 (MSA) は、複数のデータモーダルを活用して人間の感情を分析します。
既存の MSA モデルは一般に、MSA 機能を促進するために、最先端のマルチモーダル融合および表現学習ベースの手法を採用しています。
しかし、2 つの重要な課題があります。(i) 既存のマルチモーダル融合法では、モードの組み合わせの切り離しと膨大なパラメータの冗長性により、融合のパフォーマンスと効率が不十分になります。
(ii) 単峰性の特徴抽出器およびエンコーダにおける表現能力と計算オーバーヘッドとの間には、困難なトレードオフが存在します。
私たちが提案する GSIFN には、これらの問題を解決するために 2 つの主要なコンポーネントが組み込まれています。(i) グラフ構造でインターレースマスクされたマルチモーダルトランスフォーマー。
インターレースマスクメカニズムを採用して、堅牢なマルチモーダルグラフ埋め込みを構築し、オールモーダルインワンの Transformer ベースの融合を実現し、計算オーバーヘッドを大幅に削減します。
(ii) 計算オーバーヘッドが低く、パフォーマンスが高い自己教師あり学習フレームワーク。マトリックスメモリを備えた並列化された LSTM を利用して、単峰性ラベル生成のための非言語モーダル機能を強化します。
MSA データセット CMU-MOSI、CMU-MOSEI、CH-SIMS で評価された GSIFN は、以前の最先端モデルと比較して、計算オーバーヘッドが大幅に低くなり、優れたパフォーマンスを示しています。

要約(オリジナル)

Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment. Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability. However, there are two key challenges: (i) in existing multimodal fusion methods, the decoupling of modal combinations and tremendous parameter redundancy, lead to insufficient fusion performance and efficiency; (ii) a challenging trade-off exists between representation capability and computational overhead in unimodal feature extractors and encoders. Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead; (ii) a self-supervised learning framework with low computational overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal features for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computational overhead compared with previous state-of-the-art models.

arxiv情報

著者	Yijie Jin
発行日	2024-09-12 16:11:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー