Learning Object State Changes in Videos: An Open-World Perspective

要約

オブジェクトの状態変化(OSC)は映像理解に極めて重要である。人間は、見慣れたオブジェクトから未知のオブジェクトまで、OSCの理解を容易に一般化することができるが、現在のアプローチは閉じた語彙に限定されている。このギャップに対処するため、我々はビデオOSC問題に対する新しいオープンワールド定式化を導入する。その目的は、OSCの3つの段階（物体の初期状態、遷移状態、終了状態）を、学習中に物体が観察されたか否かに関わらず、時間的に局所化することである。この目的を達成するために、我々はVidOSCを開発した：(1)テキストと視覚言語モデルを監督信号に活用し、OSC訓練データに手作業でラベル付けする手間を省く。(2)オブジェクトからきめ細かい共有状態表現を抽象化し、汎化を強化する。さらに、ビデオOSCローカライゼーションのための最初のオープンワールドベンチマークであるHowToChangeを発表する。実験結果は、従来のクローズドワールドシナリオとオープンワールドシナリオの両方において、我々のアプローチの有効性を実証している。

要約(オリジナル)

Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC — the object’s initial state, its transitioning state, and its end state — whether or not the object has been observed during training. Towards this end, we develop VidOSC, a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data, and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore, we present HowToChange, the first open-world benchmark for video OSC localization, which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach, in both traditional closed-world and open-world scenarios.

arxiv情報

著者	Zihui Xue,Kumar Ashutosh,Kristen Grauman
発行日	2024-04-03 16:57:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Learning Object State Changes in Videos: An Open-World Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー