Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

要約

対話における感情を理解するには、通常、内容を正確に理解するために外部の知識が必要です。
LLM がますます強力になるにつれて、私たちは事前にトレーニングされた言語モデルの限られた機能に満足したくありません。
ただし、LLM はテキストモダリティしか処理できないか、またはマルチメディア情報を処理するにはコストが高すぎるかのいずれかです。
私たちは、LLM の力とマルチメディアモダリティの補足機能の両方を活用することを目指しています。
この論文では、受容野を意識した注意の重み付けで大規模な言語モデルを促すことによって、特定のバニラモデルのパフォーマンスを向上させることができるフレームワーク Lantern を紹介します。
このフレームワークは、マルチタスクのバニラモデルをトレーニングして、感情クラスとディメンションスコアの確率を生成しました。
これらの予測は、外部の知識と文脈の理解を使用して各感情クラスの予測確率を調整するための参照として LLM に入力されます。
対話をさまざまな受容野にスライスし、各サンプルはちょうど t 個の受容野に含まれます。
最後に、LLM の予測は、受容野を意識した注意駆動型の重み付けモジュールと結合されます。
実験では、バニラモデル CORECT および SDT が GPT-4 または Llama-3.1-405B を備えた Lantern にデプロイされます。
4 ウェイおよび 6 ウェイ設定での IEMOCAP の実験では、Lantern が現在のバニラモデルのパフォーマンスを最大 1.23% および 1.80% 大幅に向上できることが実証されました。

要約(オリジナル)

Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

arxiv情報

著者	Liyun Zhang,Dian Ding,Yu Lu,Yi-Chao Chen,Guangtao Xue
発行日	2024-11-26 18:35:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー