Prompting Visual-Language Models for Dynamic Facial Expression Recognition

要約

この論文では、DFER-CLIP と呼ばれる新しい視覚言語モデルを紹介します。これは、CLIP モデルに基づいており、実際の動的表情認識 (DFER) 用に設計されています。
具体的には、提案する DFER-CLIP は視覚的な部分とテキスト的な部分で構成されます。
視覚的な部分では、CLIP 画像エンコーダに基づいて、時間的な表情特徴を抽出するためにいくつかの Transformer エンコーダで構成される時間モデルが導入され、最終的な特徴の埋め込みが学習可能な「クラス」トークンとして取得されます。
テキスト部分については、認識したいクラス (表情) に関連する顔の動作のテキスト記述を入力として使用します。これらの記述は、ChatGPT などの大規模な言語モデルを使用して生成されます。
これは、クラス名のみを使用する作品とは対照的に、クラス間の関係をより正確に捉えています。
テキストによる説明に加えて、モデルがトレーニング中に各式に関連するコンテキスト情報を学習するのに役立つ学習可能なトークンを導入します。
広範な実験により、提案された方法の有効性が実証され、DFEW、FERV39k、および MAFW ベンチマークで現在の教師あり DFER 方法と比較して、DFER-CLIP が最先端の結果を達成することも示されています。
コードは https://github.com/zengqunzhao/DFER-CLIP で公開されています。

要約(オリジナル)

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable ‘class’ token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising — those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks. Code is publicly available at https://github.com/zengqunzhao/DFER-CLIP.

arxiv情報

著者	Zengqun Zhao,Ioannis Patras
発行日	2024-11-26 16:37:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー