Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age

要約

人種、性別、年齢、感情などの顔の属性を認識するテクノロジーには、監視、広告コンテンツ、感情分析、人口動態や社会的行動の研究など、いくつかの用途があります。
画像に基づいて人口統計的特徴を分析し、顔の表情を分析することには、人間の顔の属性が複雑であるため、いくつかの課題があります。
従来のアプローチでは、ラベル付き画像の広範なコレクションでトレーニングされた CNN やその他のさまざまな深層学習技術が採用されてきました。
これらの方法は効果的なパフォーマンスを示しましたが、さらなる機能強化の可能性がまだ残っています。
この論文では、生成事前訓練トランスフォーマー (GPT)、GEMINI、大型言語視覚アシスタント (LLAVA)、PaliGemma、Microsoft Florence2 などの視覚言語モデル (VLM) を利用して、人種、性別などの顔の属性を認識することを提案します。
人間の顔のある画像から、年齢、感情を分析します。
FairFace、AffectNet、UTKFace などのさまざまなデータセットがソリューションの評価に利用されています。
結果は、VLM が従来の技術より優れているとは言わないまでも、競争力があることを示しています。
さらに、人種、性別、年齢、感情認識のために微調整された PaliGemma モデルである「FaceScanPaliGemma」を提案します。
結果は、人種、性別、年齢層、感情分類の精度がそれぞれ 81.1%、95.8%、80%、59.4% であり、PaliGemma の事前トレーニング版、他の VLM、SotA メソッドを上回るパフォーマンスを示しました。
最後に、特定の顔属性および/または身体的属性を持つ人向けに設計されたプロンプトを使用して、画像内に複数の人物が存在する場合に上記の属性を認識する GPT-4o モデルである「FaceScanGPT」を提案します。
この結果は、検出および認識タスクを実行するためのプロンプトのみを使用して、髪のカット、衣服の色、姿勢などの個人の属性を検出する、FaceScanGPT の優れたマルチタスク機能を強調しています。

要約(オリジナル)

Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans’ facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision language models (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose ‘FaceScanPaliGemma’–a fine-tuned PaliGemma model–for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose ‘FaceScanGPT’, which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual’s attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.

arxiv情報

著者	Nouar AlDahoul,Myles Joshua Toledo Tan,Harishwar Reddy Kasireddy,Yasir Zaki
発行日	2024-10-31 17:09:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー