Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

要約

既存の歩行者属性認識 (PAR) アルゴリズムは、視覚特徴学習のバックボーンネットワークとして事前トレーニングされた CNN (ResNet など) を採用していますが、歩行者の画像と属性ラベルの間の関係が不十分に活用されているため、次善の結果が得られる可能性があります。
本論文では、PARを視覚言語融合問題として定式化し、歩行者画像と属性ラベルの関係を最大限に活用する。
具体的には、まず属性フレーズが文に拡張され、次に事前トレーニングされた視覚言語モデル CLIP が、ビジュアル画像と属性説明の特徴埋め込みのバックボーンとして採用されます。
対照的な学習目標は、CLIP ベースの特徴空間で視覚と言語のモダリティをうまく結び付け、CLIP で使用される Transformer レイヤーはピクセル間の長距離関係をキャプチャできます。
次に、マルチモーダル Transformer を採用して二重の特徴を効果的に融合し、フィードフォワードネットワークを使用して属性を予測します。
ネットワークを効率的に最適化するために、ごく少数のパラメーター (つまり、プロンプトベクトルと分類ヘッドのみ) を調整し、事前トレーニングされた VL モデルとマルチモーダル Transformer の両方を修正する、領域を意識したプロンプトチューニング手法を提案します。
私たちが提案する PAR アルゴリズムは、微調整戦略と比較して、学習可能なパラメーターの 0.75% のみを調整します。
また、RAPv1、RAPv2、WIDER、PA100K、PETA-ZS、RAP-ZS データセットなどの PAR の標準およびゼロショット設定の両方で新しい最先端のパフォーマンスを実現します。
ソースコードと事前トレーニングされたモデルは https://github.com/Event-AHU/OpenPAR でリリースされます。

要約(オリジナル)

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.

arxiv情報

著者	Xiao Wang,Jiandong Jin,Chenglong Li,Jin Tang,Cheng Zhang,Wei Wang
発行日	2023-12-17 11:59:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー