Recur, Attend or Convolve? On Whether Temporal Modeling Matters for Cross-Domain Robustness in Action Recognition


単一のフレームからは明らかにされない、時間的構造をキャプチャする能力の軽量で体系的な評価を可能にするために、Temporal Shape(TS)データセットと、Diving48の変更されたドメインを提供します。
Diving48のドメインシフト実験では、3D CNNと注意ベースのモデルが、畳み込み反復モデルよりも多くのテクスチャバイアスを示すことが示されています。


Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. It has also been shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape in still image recognition tasks. Taken together, this raises suspicion that large video models partly learn spurious correlations rather than to track relevant shapes over time to infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence. In this article, we empirically study whether the choice of low-level temporal modeling has consequences for texture bias and cross-domain robustness. In order to enable a light-weight and systematic assessment of the ability to capture temporal structure, not revealed from single frames, we provide the Temporal Shape (TS) dataset, as well as modified domains of Diving48 allowing for the investigation of texture bias for video models. We find that across a variety of model sizes, convolutional-recurrent and attention-based models show better out-of-domain robustness on TS than 3D CNNs. In domain shift experiments on Diving48, our experiments indicate that 3D CNNs and attention-based models exhibit more texture bias than convolutional-recurrent models. Moreover, qualitative examples suggest that convolutional-recurrent models learn more correct class attributes from the diving data when compared to the other two types of models at the same global validation performance.


著者 Sofia Broomé,Ernest Pokropek,Boyu Li,Hedvig Kjellström
発行日 2022-07-15 10:47:06+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CV パーマリンク