VG4D: Vision-Language Model Goes 4D Video Recognition

要約

点群ビデオを通じて現実世界を理解することは、ロボット工学と自動運転システムの重要な側面です。
ただし、4D 点群認識の一般的な方法にはセンサーの解像度による制限があり、詳細な情報が不足します。
最近の進歩により、Web スケールのテキスト画像データセットで事前トレーニングされた視覚言語モデル (VLM) が、さまざまな下流タスクに転送できるきめ細かい視覚概念を学習できることが示されました。
ただし、VLM を 4D 点群のドメインに効果的に統合することは未解決の問題のままです。
この研究では、VLM 知識をビジュアルテキストの事前トレーニング済みモデルから 4D 点群ネットワークに転送するための Vision-Language Models Goes 4D (VG4D) フレームワークを提案します。
私たちのアプローチには、4D エンコーダーの表現を VLM と調整して、大規模な画像とテキストのペアのトレーニングから共有のビジュアル空間とテキスト空間を学習することが含まれます。
VLM の知識を 4D エンコーダーに転送し、VLM を組み合わせることで、VG4D は認識パフォーマンスの向上を実現します。
4D エンコーダを強化するために、古典的な動的点群バックボーンを最新化し、点群ビデオを効率的にモデル化できる PSTNet の改良版 im-PSTNet を提案します。
実験では、私たちの方法が NTU RGB+D 60 データセットと NTU RGB+D 120 データセットの両方で動作認識の最先端のパフォーマンスを達成することを示しています。
コードは \url{https://github.com/Shark0-0/VG4D} で入手できます。

要約(オリジナル)

Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder’s representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.

arxiv情報

著者	Zhichao Deng,Xiangtai Li,Xia Li,Yunhai Tong,Shen Zhao,Mengyuan Liu
発行日	2024-04-17 17:54:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VG4D: Vision-Language Model Goes 4D Video Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー