Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

要約

対照的な言語イメージの事前削除（CLIP）には、画像ベースのビジョン学習が大幅に進歩しています。
その後、プレストピックが発生します。クリップをビデオドメインに効果的に適応させるにはどうすればよいですか？
最近の研究では、アクション認識のためにクリップのテキストまたは視覚的ブランチのいずれかを調整することに焦点を当てています。
ただし、両方のブランチの適応が非常に重要であると主張しています。
この論文では、\ textbf {claver}：a \ textbf {c} ontrastive \ textbf {l} anguage- \ textbf {a} ction \ textbfを提案します。
静的視覚オブジェクトとコンクリート名詞のアライメントから、動的アクション動作と抽象動詞のアライメントまで。
具体的には、時間モデリングのために新しいKroneckerマスクの注意を紹介します。
私たちのテーラードクロネッカーマスクは3つの利点を提供します1）各トークンの時間的受容フィールドを拡張します。
モデル。
テキストブランチに関しては、大規模な言語モデルを活用して、多様な文レベルで意味的に豊富なアクションプロンプトを生成し、モデルの焦点を動詞理解にシフトします。
さまざまなベンチマークや学習シナリオでの広範な実験は、アプローチの優位性と一般性を示しています。

要約(オリジナル)

Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbf{CLAVER}: a \textbf{C}ontrastive \textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to shift CLIP’s focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal heterogeneity inductive bias, mitigating the issue of spatiotemporal homogenization, and 3) it can be seamlessly plugged into transformer-based models. Regarding the textual branch, we leverage large language models to generate diverse, sentence-level and semantically rich interpretive prompts of actions, which shift the model’s focus towards the verb comprehension. Extensive experiments on various benchmarks and learning scenarios demonstrate the superiority and generality of our approach.

arxiv情報

著者	Jingyi Yang,Zitong Yu,Xiuming Ni,Jia He,Hui Li
発行日	2025-02-10 03:28:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー