LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

要約

タイトル: ダイナミックな表情認識に対するローカルグローバル空間的・時間的Transformer
要約:
– DFERの従来の方法はCNNに基づいているが、ローカル操作はビデオ内の長距離依存関係を無視するため、Transformerベースの方法がより良いパフォーマンスを発揮する。
– しかし、Transformerベースの方法はFLOPsとコンピュータコストが高くなるため、この問題を解決するために、ローカルグローバル空間的・時間的Transformer（LOGO-Former）が提案された。
– LOGO-Formerは、各フレーム内の識別的な特徴を捉え、フレーム間の文脈的関係をモデル化しながら複雑さをバランスを持たせる。
– このTransformerは、顔の筋肉が局所的に動き、表情が徐々に変化することから、スペース注意と時間注意をローカルウィンドウに制限して、特徴トークン間のローカルな相互作用を捕捉することができる。
– さらに、グローバルアテンションを実行することで、各ローカルウィンドウの特徴からイテレーション的にトークンをクエリし、全ビデオシーケンスの長距離情報を取得することができる。
– さらに、最小クラス内距離と最大クラス間距離を持つように学習された特徴があり、学習された特徴が狭くなるようにコンパクトな損失正則化項を提案している。
– DFEWとFERV39Kの2つの野生のダイナミック表情データセットの実験結果は、この方法がDFERにおいて空間および時間の依存関係を効果的に利用するための有効な手段を提供することを示している。

要約(オリジナル)

Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs and computational costs. To solve these problems, the local-global spatio-temporal Transformer (LOGO-Former) is proposed to capture discriminative features within each frame and model contextual relationships among frames while balancing the complexity. Based on the priors that facial muscles move locally and facial expressions gradually change, we first restrict both the space attention and the time attention to a local window to capture local interactions among feature tokens. Furthermore, we perform the global attention by querying a token with features from each local window iteratively to obtain long-range information of the whole video sequence. In addition, we propose the compact loss regularization term to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.

arxiv情報

著者	Fuyan Ma,Bin Sun,Shutao Li
発行日	2023-05-05 07:53:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー