InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

要約

テキストと画像の組み合わせを通じて伝えられるソーシャルメディアでの皮肉の蔓延は、感情分析と意図マイニングにとって大きな課題となっています。
既存のマルチモーダル皮肉検出方法は、画像とテキストの間の相互作用から生じる複雑な皮肉の手がかりを効果的に捕捉するのに苦労しているため、パフォーマンスを過大評価することが証明されています。
これらの問題に対処するために、マルチモーダル皮肉検出のための新しいフレームワークである InterCLIP-MEP を提案します。
具体的には、テキストと画像の表現を抽出するためのバックボーンとして Interactive CLIP (InterCLIP) を導入し、クロスモダリティ情報を各エンコーダー内に直接埋め込むことで表現を強化し、それによってテキストと画像のインタラクションをより適切にキャプチャできるように表現を改善します。
さらに、効率的なトレーニング戦略は、InterCLIP を私たちが提案するメモリ強化予測器 (MEP) に適応させるように設計されています。
MEP は、動的な固定長デュアルチャネルメモリを使用して、推論中に貴重なテストサンプルの履歴情報を保存します。
次に、この記憶をノンパラメトリック分類子として利用して最終的な予測を導き出し、マルチモーダルな皮肉をより確実に認識します。
実験では、InterCLIP-MEP が MMSD2.0 ベンチマークで最先端のパフォーマンスを達成し、以前の最良の方法と比較して精度が 1.08% 向上し、F1 スコアが 1.51% 向上したことが実証されました。

要約(オリジナル)

The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection methods have been proven to overestimate performance, as they struggle to effectively capture the intricate sarcastic cues that arise from the interaction between an image and text. To address these issues, we propose InterCLIP-MEP, a novel framework for multi-modal sarcasm detection. Specifically, we introduce an Interactive CLIP (InterCLIP) as the backbone to extract text-image representations, enhancing them by embedding cross-modality information directly within each encoder, thereby improving the representations to capture text-image interactions better. Furthermore, an efficient training strategy is designed to adapt InterCLIP for our proposed Memory-Enhanced Predictor (MEP). MEP uses a dynamic, fixed-length dual-channel memory to store historical knowledge of valuable test samples during inference. It then leverages this memory as a non-parametric classifier to derive the final prediction, offering a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark, with an accuracy improvement of 1.08% and an F1 score improvement of 1.51% over the previous best method.

arxiv情報

著者	Junjie Chen,Hang Yu,Weidong Liu,Subin Huang,Sanmin Liu
発行日	2024-08-13 09:52:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー