Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

要約

時空間人間対物体インタラクション (ST-HOI) の理解は、アクティビティの理解に不可欠なビデオから HOI を検出することを目的としています。
しかし、既存の全身オブジェクトインタラクションビデオベンチマークは、オープンワールドオブジェクトが多様である、つまり、オープンワールドオブジェクトは通常、限定された事前定義されたオブジェクトクラスを提供するという真実を見落としています。
したがって、新しいオープンワールドベンチマークを導入します。Grounding Interacted Objects (GIO) には、1,098 個のインタラクトオブジェクトクラスと 290K のインタラクトオブジェクトボックスアノテーションが含まれます。
したがって、視覚システムが相互作用するオブジェクトを発見することを期待するオブジェクトグラウンディングタスクが提案されています。
今日の検出器と接地方法は大きく成功していますが、GIO 内の多様で希少な物体の位置を特定するには満足のいく性能が得られません。
これは現在の視覚システムの限界を深く明らかにしており、大きな課題となっています。
したがって、私たちは時空間キューを活用してオブジェクトのグラウンディングに対処することを検討し、多様なビデオからインタラクションされたオブジェクトを発見するための 4D 質問応答フレームワーク (4D-QA) を提案します。
私たちの方法は、現在のベースラインと比較して、広範な実験において顕著な優位性を実証しています。
データとコードは https://github.com/DirtyHarryLYL/HAKE-AVA で公開されます。

要約(オリジナル)

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at https://github.com/DirtyHarryLYL/HAKE-AVA.

arxiv情報

著者	Xiaoyang Liu,Boran Wen,Xinpeng Liu,Zizheng Zhou,Hongwei Fan,Cewu Lu,Lizhuang Ma,Yulong Chen,Yong-Lu Li
発行日	2024-12-27 09:08:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー