Multimodal Distillation for Egocentric Action Recognition

要約

自己中心的なビデオ理解の焦点は、手とオブジェクトのインタラクションをモデル化することです。
標準モデル、例:
RGB フレームを入力として受け取る CNN または Vision Transformer は良好にパフォーマンスします。
ただし、物体検出、オプティカルフロー、オーディオなどの補完的な手がかりを提供する追加の入力モダリティを採用することで、パフォーマンスはさらに向上します。一方、モダリティ固有のモジュールの複雑さが増すため、これらのモデルの展開は非実用的になります。
この作業の目標は、推論時に入力として RGB フレームのみを使用しながら、このようなマルチモーダルアプローチのパフォーマンスを維持することです。
Epic-Kitchens と Something-Something データセットでの自己中心的な行動認識については、マルチモーダル教師によって教えられた生徒のほうが、単峰性またはマルチモーダル方式でグラウンドトゥルースラベルでトレーニングされたアーキテクチャ的に同等のモデルよりも正確で、より適切に調整される傾向があることを実証します。
さらに、原則に基づいたマルチモーダルな知識の蒸留フレームワークを採用し、マルチモーダルな知識の蒸留を素朴な方法で適用するときに発生する問題に対処できるようにします。
最後に、計算の複雑さの削減が達成されたことを実証し、私たちのアプローチが入力ビューの数を減らしてもより高いパフォーマンスを維持することを示します。

要約(オリジナル)

The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views.

arxiv情報

著者	Gorjan Radevski,Dusan Grujicic,Marie-Francine Moens,Matthew Blaschko,Tinne Tuytelaars
発行日	2023-07-14 17:07:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Distillation for Egocentric Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー