Anticipating Object State Changes in Long Procedural Videos

要約

この研究では、(a) 手続き中の画像やビデオにおけるオブジェクトの状態変化を予測するという新しい問題、(b) Ego4D データセットに基づいてオブジェクトの状態変化を分類するための新たに厳選された注釈データ、および (c) 最初の方法を紹介します。
この困難な問題に対処するために。
この新しいタスクに対するソリューションは、ビジョンベースの現場の理解、自動監視システム、および行動計画に重要な意味を持ちます。
提案された新しいフレームワークは、最近の視覚情報を表す学習された視覚特徴と、過去の物体の状態変化や行動を表す自然言語 (NLP) 特徴を統合することにより、まだ見たことのない人間の行動によって近い将来に起こる物体の状態変化を予測します。
多数のインタラクションシナリオにわたる一人称視点ビデオの大規模コレクションを提供する広範でやりがいのある Ego4D データセットを活用して、オブジェクト状態変化予測タスク (OSCA) に新しく厳選された注釈データを提供する Ego4D-OSCA と呼ばれる拡張機能を導入します。
動的シナリオにおけるオブジェクトの状態変化を予測する際の、提案された方法の有効性を実証する広範な実験評価が示されています。
提案されたアプローチのパフォーマンスは、ビデオ理解システムの予測パフォーマンスを向上させるためにビデオと言語キューを統合する可能性を強調し、オブジェクトの状態変化の予測という新しいタスクに関する将来の研究の基礎を築きます。
ソースコードと新規アノテーションデータ（Ego4D-OSCA）を公開します。

要約(オリジナル)

In this work, we introduce (a) the new problem of anticipating object state changes in images and videos during procedural activities, (b) new curated annotation data for object state change classification based on the Ego4D dataset, and (c) the first method for addressing this challenging problem. Solutions to this new task have important implications in vision-based scene understanding, automated monitoring systems, and action planning. The proposed novel framework predicts object state changes that will occur in the near future due to yet unseen human actions by integrating learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce an extension noted Ego4D-OSCA that provides new curated annotation data for the object state change anticipation task (OSCA). An extensive experimental evaluation is presented demonstrating the proposed method’s efficacy in predicting object state changes in dynamic scenarios. The performance of the proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems and lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.

arxiv情報

著者	Victoria Manousaki,Konstantinos Bacharidis,Filippos Gouidis,Konstantinos Papoutsakis,Dimitris Plexousakis,Antonis Argyros
発行日	2024-12-02 11:16:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Anticipating Object State Changes in Long Procedural Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー