EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

要約

エゴセントリックなビデオ言語による事前学習は、エゴセントリックな手とオブジェクトの相互作用（EgoHOI）の学習を進める上で重要なパラダイムである。既存のテストベッドで大きな成功を収めているにもかかわらず、これらのベンチマークは閉じた視覚概念や限定されたシナリオに焦点を当てている。実世界には多様なEgoHOIが存在するため、我々はEgoHOIBenchと名付けたオープン語彙ベンチマークを提案し、現在の自己中心的なビデオ言語モデル（EgoVLM）の、細かい概念に対する性能の低下を明らかにする。この性能差は、現在の手法では、きめ細かな監視が不十分であることと、時間的ダイナミクスよりも物体の理解に強く偏っていることに起因する。これらの問題に取り組むために、我々はEgoNCE++と名付けられたEgoHOIのための新しい非対称的な対照的目的語を導入する。ビデオ-テキスト間の損失に対しては、HOIに関連する単語の置換を実行するための大規模な言語モデルのコンテキスト内学習を活用することで、ネガティブキャプションの生成を通してテキスト監視を強化する。テキストからビデオへの損失に対しては、同じ名詞によるビデオ表現を集約するオブジェクト中心のポジティブビデオサンプリング戦略を提案する。我々の広範な実験により、EgoNCE++が、様々なエゴセントリックモデルにおいて、オープンボキャブラリーのHOI認識、マルチインスタンス検索、および行動認識タスクを大幅に向上させ、最大+26.55%まで改善することが実証された。我々のコードはhttps://github.com/xuboshen/EgoNCEpp。

要約(オリジナル)

Egocentric video-language pretraining is a crucial paradigm to advance the learning of egocentric hand-object interactions (EgoHOI). Despite the great success on existing testbeds, these benchmarks focus more on closed-set visual concepts or limited scenarios. Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding. We attribute this performance gap to insufficient fine-grained supervision and strong bias towards understanding objects rather than temporal dynamics in current methods. To tackle these issues, we introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. For video-to-text loss, we enhance text supervision through the generation of negative captions by leveraging the in-context learning of large language models to perform HOI-related word substitution. For text-to-video loss, we propose an object-centric positive video sampling strategy that aggregates video representations by the same nouns. Our extensive experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks across various egocentric models, with improvements of up to +26.55%. Our code is available at https://github.com/xuboshen/EgoNCEpp.

arxiv情報

著者	Boshen Xu,Ziheng Wang,Yang Du,Zhinan Song,Sipeng Zheng,Qin Jin
発行日	2024-06-03 07:29:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー