OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

要約

ビデオ視覚関係検出（VIDVRD）タスクは、動的なコンテンツ、高い注釈コスト、および関係の長期尾の分布のために挑戦的なビデオでオブジェクトとその関係を識別することです。
Visual Language Models（VLMS）は、オープンボキャブラリーの視覚関係検出タスクを探索するのに役立ちますが、多くの場合、さまざまな視覚領域とその関係の間のつながりを見落としています。
さらに、VLMを使用してビデオの視覚関係を直接識別することは、画像とビデオの間に大きな格差が大きいため、重要な課題をもたらします。
したがって、OpenVidVrdと呼ばれる新しいオープンボキャブラリーVIDVRDフレームワークを提案します。これは、VLMの豊富な知識と強力な能力を迅速な学習を通じて改善する強力な能力を転送します。
具体的には、VLMを使用して、ビデオの領域に基づいて自動的に生成された領域キャプションからテキスト表現を抽出します。
次に、クロスモーダルの時空間的補完情報を統合することにより、ビデオのオブジェクトレベルの関係表現を導出するための時空間的なリファイナーモジュールを開発します。
さらに、意味空間を調整するための迅速な駆動型戦略が採用され、VLMSの意味的理解を活用して、OpenVidVrdの全体的な一般化能力を高めます。
VIDVRDおよびVIDORパブリックデータセットで実施された広範な実験は、提案されたモデルが既存の方法よりも優れていることを示しています。

要約(オリジナル)

The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs’ rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video’s regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.

arxiv情報

著者	Qi Liu,Weiying Xue,Yuxiao Wang,Zhenao Wei
発行日	2025-03-12 14:13:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー