Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

要約

人間の活動を描写するビデオストリームに対する我々の理解力は、当然ながら多面的である。ほんの一瞬のうちに、何が起きているのかを把握し、シーン内のオブジェクトの関連性や相互作用を特定し、間もなく起こることを予測することができる。自律システムにこのような全体的な知覚を与えるには、概念の関連付け、多様なタスクにまたがる知識の抽象化、新しいスキルを学習する際のタスクの相乗効果の活用方法を学習することが不可欠である。この方向への重要な一歩が、最小限のオーバーヘッドで多様なタスクにまたがる人間の活動を理解するための統一フレームワーク、EgoPackである。EgoPackは、新しいスキルを効率的に学習するために不可欠な、下流タスク間の情報共有と連携を促進する。本論文では、Hier-EgoPackを紹介する。Hier-EgoPackは、EgoPackを進化させ、多様な時間粒度にわたる推論を可能にすることで、より幅広い下流タスクへの適用を拡大する。これを実現するために、多粒度推論の課題に効果的に取り組むために特別に設計されたGNN層を備えた、時間推論のための新しい階層アーキテクチャを提案する。我々は、クリップレベル推論とフレームレベル推論の両方を含む複数のEgo4dベンチマークで我々のアプローチを評価し、我々の階層的統一アーキテクチャがいかに効果的にこれらの多様なタスクを同時に解決するかを実証する。

要約(オリジナル)

Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

arxiv情報

著者	Simone Alberto Peirone,Francesca Pistilli,Antonio Alliegro,Tatiana Tommasi,Giuseppe Averta
発行日	2025-02-04 17:03:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー