UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

要約

視覚的感情分析は、コンピュータービジョンと心理学の両方において重要な研究価値を持っています。
しかし、視覚的感情分析のための既存の方法は、感情知覚の曖昧さとデータシナリオの多様性により、一般化可能性が限られています。
この問題に取り組むために、クロスモーダルなセマンティックガイドによる大規模な事前トレーニングフレームワークである UniEmoX を導入します。
個人とその環境との相互作用から感情の探求プロセスが切り離せないことを強調する心理学研究に触発された UniEmoX は、シーン中心と人物中心の低レベル画像空間構造情報を統合し、より微妙で識別力のある感情表現を導き出すことを目指しています。
UniEmoX は、ペアになっている画像テキストサンプルとペアになっていない画像テキストサンプル間の類似性を利用することで、CLIP モデルから豊富な意味論的な知識を抽出し、感情の埋め込み表現をより効果的に強化します。
私たちの知る限り、これは心理理論と現代の対照学習および多様なシナリオにわたる感情分析のためのマスクされた画像モデリング技術を統合した最初の大規模な事前トレーニングフレームワークです。
さらに、Emo8 というタイトルの視覚的感情データセットを開発しています。
Emo8 のサンプルは、漫画、自然、現実的、SF、広告のカバースタイルなどの幅広い領域をカバーしており、一般的な感情的なシーンのほぼすべてをカバーしています。
2 つの下流タスクにわたる 6 つのベンチマークデータセットに対して実施された包括的な実験により、UniEmoX の有効性が検証されました。
ソースコードは https://github.com/chincharles/u-emo で入手できます。

要約(オリジナル)

Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.

arxiv情報

著者	Chuang Chen,Xiao Sun,Zhi Liu
発行日	2024-09-27 16:12:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー