Learning State-Aware Visual Representations from Audible Interactions

要約

自己中心的なビデオデータから表現を学習する自己教師ありアルゴリズムを提案します。
最近、人間が日常の活動を行っているときに自分の環境と相互作用している様子を捉えるために、多大な努力が払われてきました。
その結果、相互作用が豊富なマルチモーダルデータのいくつかの大規模な自己中心的なデータセットが出現しました。
ただし、ビデオから表現を学習するのは難しい場合があります。
まず、長い形式の連続動画のキュレーションされていない性質を考えると、効果的な表現を学習するには、相互作用が発生する瞬間に焦点を当てる必要があります。
第二に、日常活動の視覚的表現は、環境の状態の変化に敏感でなければなりません。
ただし、現在成功しているマルチモーダル学習フレームワークは、時間の経過とともに表現の不変性を促進します。
これらの課題に対処するために、音声信号を活用して、より良い学習につながる可能性のある相互作用の瞬間を特定します。
また、相互作用によって引き起こされる可聴状態の変化から学習する、新しい自己教師付き目標を提案します。
これらの貢献を 2 つの大規模な自己中心的なデータセット、EPIC-Kitchens-100 と最近リリースされた Ego4D で広範囲に検証し、アクション認識、長期的なアクションの予測、オブジェクトの状態変化の分類など、いくつかのダウンストリームタスクの改善を示します。

要約(オリジナル)

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.

arxiv情報

著者	Himangi Mittal,Pedro Morgado,Unnat Jain,Abhinav Gupta
発行日	2022-09-27 17:57:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning State-Aware Visual Representations from Audible Interactions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー