Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

要約

少数ショット行動認識の現在の方法は、主に ProtoNet に続くメトリック学習フレームワークに分類されます。
ただし、代表的なプロトタイプの効果を無視するか、マルチモーダル情報でプロトタイプを適切に強化できません。
この作業では、ラベルテキストのセマンティック情報をマルチモーダル情報として使用して、2 つのモダリティフローを含むプロトタイプを強化する、新しいマルチモーダルプロトタイプ拡張ネットワーク (MORN) を提案します。
CLIP ビジュアルエンコーダーがビジュアルフローに導入され、ビジュアルプロトタイプが Temporal-Relational CrossTransformer (TRX) モジュールによって計算されます。
フリーズされた CLIP テキストエンコーダーがテキストフローに導入され、セマンティック拡張モジュールを使用してテキスト機能が強化されます。
膨らませた後、テキストプロトタイプが取得されます。
最終的なマルチモーダルプロトタイプは、マルチモーダルプロトタイプ拡張モジュールによって計算されます。
また、プロトタイプの品質を評価するための評価指標は存在しません。
私たちの知る限りでは、さまざまなカテゴリを識別する際のプロトタイプのパフォーマンスを評価するために使用されるプロトタイプ類似度差 (PRIDE) と呼ばれるプロトタイプ評価メトリックを提案したのは、私たちが初めてです。
4 つの一般的なデータセットで広範な実験を行います。
MORN は、HMDB51、UCF101、Kinetics、および SSv2 で最先端の結果を達成しています。
MORN は PRIDE でもうまく機能し、PRIDE と精度の相関関係を調査します。

要約(オリジナル)

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy.

arxiv情報

著者	Xinzhe Ni,Hao Wen,Yong Liu,Yatai Ji,Yujiu Yang
発行日	2022-12-09 14:24:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー