HD-EPIC: A Highly-Detailed Egocentric Video Dataset

要約

新しく収集されたキッチンベースのエゴセントリックビデオの検証データセットを提示します。これは、レシピの手順、栄養価を備えた成分、移動オブジェクト、オーディオ注釈をカバーする、非常に詳細で相互接続されたグラウンドトゥルースラベルを手動で注釈を付けます。
重要なことに、すべての注釈は、シーンのデジタルツインニング、フィクスチャー、オブジェクトの位置、および視線でプライミングされた3Dに根ざしています。
映像は、多様なホーム環境でのスクリプト化されていない録画から収集され、HDEPICは最初のデータセットとなっていますが、詳細な注釈は制御されたラボ環境のものと一致しています。
レシピ、成分、栄養、微細なアクション、3D知覚、オブジェクトの動き、視線の方向を認識する能力を評価する26kの質問の挑戦的なVQAベンチマークを通じて、高度に控えめな注釈の可能性を示します。
強力なロングコンテストのジェミニプロは、このベンチマークで38.5％のみを達成し、その難しさを紹介し、現在のVLMの欠点を強調しています。
さらに、HD-EPICでのアクション認識、健全な認識、および長期のビデオオブジェクトセグメンテーションを評価します。
HD-EPICは、413のキッチンフィクスチャのデジタルツインを備えた9つのキッチンで41時間のビデオであり、69のレシピ、59kの細かいアクション、51Kオーディオイベント、20Kオブジェクトの動き、37Kオブジェクトマスクを3Dに持ち上げます。
平均して、スクリプト化されていないビデオの1分あたり263の注釈があります。

要約(オリジナル)

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

arxiv情報

著者	Toby Perrett,Ahmad Darkhalil,Saptarshi Sinha,Omar Emara,Sam Pollard,Kranti Parida,Kaiting Liu,Prajwal Gatti,Siddhant Bansal,Kevin Flanagan,Jacob Chalk,Zhifan Zhu,Rhodri Guerrier,Fahd Abdelazim,Bin Zhu,Davide Moltisanti,Michael Wray,Hazel Doughty,Dima Damen
発行日	2025-02-06 15:25:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー