Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining

要約

人間は通常、顔の表情によって自発的または非自発的に感情を伝えます。
顔の画像から基本的な表情（幸福、悲しみ、中立など）を自動的に認識すること、つまり表情認識（FER）は非常に困難であり、多くの研究関心を集めています。
この問題に対処するために、大規模なデータセットと強力な推論モデルが提案されています。
かなりの進歩が見られましたが、畳み込みニューラルネットワーク（CNN）または精巧に変更されたビジョントランスフォーマー（ViT）を採用している最先端技術のほとんどは、上流の教師あり事前トレーニングに大きく依存しています。
トランスフォーマーは、ますます多くのコンピュータービジョンタスクでCNNの支配を行っています。
ただし、CNNと比較して誘導バイアスの使用量が少ないため、通常、トレーニングにははるかに多くのデータが必要です。
アップストリームタスクからの追加のトレーニングサンプルなしでバニラViTが競争力のある精度を達成できるかどうかを調べるために、MAE事前トレーニングを備えたプレーンViTを使用してFERタスクを実行します。
具体的には、最初に、元のViTを、式ラベルのない大きな顔の表情データセットでマスクされたオートエンコーダー（MAE）として事前トレーニングします。
次に、表情ラベルを使用して、人気のある顔の表情データセットでViTを微調整します。
提示された方法は、RAF-DBでは90.22 \％、AfectNetでは61.73 \％と非常に競争力があり、FER研究のシンプルでありながら強力なViTベースのベースラインとして機能します。

要約(オリジナル)

Humans usually convey emotions voluntarily or involuntarily by facial expressions. Automatically recognizing the basic expression (such as happiness, sadness, and neutral) from a facial image, i.e., facial expression recognition (FER), is extremely challenging and attracts much research interests. Large scale datasets and powerful inference models have been proposed to address the problem. Though considerable progress has been made, most of the state of the arts employing convolutional neural networks (CNNs) or elaborately modified Vision Transformers (ViTs) depend heavily on upstream supervised pretraining. Transformers are taking place the domination of CNNs in more and more computer vision tasks. But they usually need much more data to train, since they use less inductive biases compared with CNNs. To explore whether a vanilla ViT without extra training samples from upstream tasks is able to achieve competitive accuracy, we use a plain ViT with MAE pretraining to perform the FER task. Specifically, we first pretrain the original ViT as a Masked Autoencoder (MAE) on a large facial expression dataset without expression labels. Then, we fine-tune the ViT on popular facial expression datasets with expression labels. The presented method is quite competitive with 90.22\% on RAF-DB, 61.73\% on AfectNet and can serve as a simple yet strong ViT-based baseline for FER studies.

arxiv情報

著者	Jia Li,Ziyang Zhang
発行日	2022-07-22 13:39:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー