Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

要約

フェイシャル・アクション・コーディング・システム（Facial Action Coding System: FACS）で定義されるフェイシャル・アクション・ユニット（Facial Action Unit: FAU）は、顔状態分析におけるその多様な応用範囲により、大きな研究関心を集めている。現在主流のFAU認識モデルは、AUの認識精度のみに着目し、対応するAU状態の説明を見落としているという顕著な限界がある。本論文では、説明可能なFAU認識のためのエンドツーエンドの視覚-言語結合学習ネットワーク（VL-FAUと呼ぶ）を提案する。具体的には、VL-FAUは、FAU認識を最適化する際に、きめ細かい局所的な筋肉記述と区別可能な大域的な顔記述を生成するために言語モデルを統合する。これにより、グローバルな顔表現とそのローカルなAU表現は、異なるAUや異なる被験者間でより高い識別性を達成する。さらに、マルチレベルAU表現学習は、マルチスケール複合顔幹特徴に基づくAU個々の注意を考慮した表現能力を向上させるために利用される。DISFAとBP4DのAUデータセットを用いた広範な実験により、提案手法は、ほとんどの指標において、最先端の手法よりも優れた性能を達成することが示される。さらに、主流のFAU認識手法と比較して、VL-FAUは、AUの予測とともに、ローカルレベルおよびグローバルレベルの解釈可能な言語記述を提供することができる。

要約(オリジナル)

Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs’ predictions.

arxiv情報

著者	Xuri Ge,Junchen Fu,Fuhai Chen,Shan An,Nicu Sebe,Joemon M. Jose
発行日	2024-08-01 15:35:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー