Open-Vocabulary Action Localization with Iterative Visual Prompting

要約

ビデオアクションローカリゼーションは、長いビデオから特定のアクションのタイミングを見つけることを目的としています。
既存の学習ベースのアプローチは成功していますが、かなりの人件費が伴うビデオに注釈を付ける必要があります。
このペーパーでは、新たな既製の視覚言語モデル（VLM）に基づいた、トレーニングフリーのオープンボキャブラリーアプローチを提案します。
この課題は、VLMが長いビデオを処理するように設計されていないか、アクションを見つけるために調整されているという事実に起因しています。
反復的な視覚プロンプト技術を拡張することにより、これらの問題を克服します。
具体的には、ビデオフレームをサンプリングし、フレームインデックスラベルを使用して連結した画像を作成し、VLMがアクションの開始と終了に対応する可能性が最も高いフレームを識別できるようにします。
選択したフレームの周りのサンプリングウィンドウを繰り返し絞ることにより、推定は徐々により正確な時間的境界に収束します。
この手法が合理的なパフォーマンスをもたらし、最先端のゼロショットアクションローカリゼーションに匹敵する結果を達成することを実証します。
これらの結果は、ビデオを理解するための実用的なツールとしてのVLMの使用をサポートしています。
サンプルコードはhttps://microsoft.github.io/vlm-video-アクションロカリゼーション/で入手できます。

要約(オリジナル)

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/

arxiv情報

著者	Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi
発行日	2025-04-07 10:55:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Open-Vocabulary Action Localization with Iterative Visual Prompting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー