Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

要約

ビデオからの学習は、ロボットが手順ビデオなどの人間のデモンストレーションからスキルを習得できるようにする新しい研究分野です。
これを行うには、ビデオ言語モデルは、デモンストレーションを一連のアクションとスキルに時間的に分割するなど、構造化された理解を取得し、その理解を新しい領域に一般化できなければなりません。
この目標を追求するために、我々は、2 つのタスクを含むベンチマークである Spacewalk-18 を導入します。(1) ステップ認識と、(2) 国際宇宙ステーションの船外活動記録における時間的にセグメント化されラベル付けされたタスクのデータセットに対するビデオ内検索です。
2 つのタスクは並行して、以下を利用するモデルの能力を定量化します。(1) ドメイン外の視覚情報。
(2) 高時間コンテキストウィンドウ。
(3) マルチモーダル (テキスト + ビデオ) ドメイン。
これは、通常、短いコンテキスト長を扱い、単一のモダリティで解決できる、手続き型ビデオ理解の既存のベンチマークとは異なります。
Spacewalk-18 は、その固有のマルチモーダルで長い形式の複雑さにより、タスクの認識とセグメント化の高い難易度を明らかにしています。
私たちのベンチマークでは最先端の手法のパフォーマンスが低いことがわかり、これは一般化可能な手続き型ビデオ理解モデルの目標がはるかに遠いことを示しており、これらのタスクに対する新しいアプローチを開発する必要性を強調しています。
データ、モデル、コードは公開されます。

要約(オリジナル)

Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model’s ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.

arxiv情報

著者	Rohan Myer Krishnan,Zitian Tang,Zhiqiu Yu,Chen Sun
発行日	2023-11-30 18:19:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー