COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

要約

視覚と言語のコミュニティでは、手順に基づいたビデオの理解が注目を集めています。
ディープラーニングベースのビデオ分析には大量のデータが必要です。
その結果、既存の作品ではトレーニングリソースとして Web ビデオが使用されることが多く、生のビデオ観察から指導内容をクエリすることが困難になっています。
この問題に対処するために、新しいデータセットである COM Kitchens を提案します。
このデータセットは、スマートフォンで撮影された未編集の俯瞰ビデオで構成されており、参加者は与えられたレシピに基づいて食事の準備を行いました。
固定視点のビデオデータセットは、カメラのセットアップコストが高いため、環境の多様性に欠けていることがよくあります。
最新の広角スマートフォンレンズを使用して、調理カウンターをシンクからクックトップまで俯瞰でカバーし、対面での支援なしでアクティビティを撮影しました。
この設定では、参加者にスマートフォンを配布して多様なデータセットを収集しました。
このデータセットを使用して、新しいビデオからテキストへの検索タスク Online Recipe Retrieval (OnRR) と、未編集の Overhead-View ビデオ (DVC-OV) 上の新しいビデオキャプションドメイン Dense Video Captioning を提案します。
私たちの実験では、これらのタスクを処理する際の現在の Web ビデオベースの SOTA メソッドの機能と限界を検証しました。

要約(オリジナル)

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

arxiv情報

著者	Koki Maeda,Tosho Hirasawa,Atsushi Hashimoto,Jun Harashima,Leszek Rybicki,Yusuke Fukasawa,Yoshitaka Ushiku
発行日	2024-08-05 07:00:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー