Understanding Video Transformers via Universal Concept Discovery

要約

この論文では、ビデオのトランス表現の概念ベースの解釈可能性の問題を研究します。
具体的には、自動的に発見される高レベルの時空間概念に基づいて、ビデオトランスフォーマーの意思決定プロセスを説明しようとします。
概念ベースの解釈可能性に関するこれまでの研究は、画像レベルのタスクのみに集中していました。
比較的、ビデオモデルは追加された時間的次元に対処するため、複雑さが増し、時間の経過とともに動的なコンセプトを特定する際に課題が生じます。
この研究では、最初の Video Transformer Concept Discovery (VTCD) アルゴリズムを導入することで、これらの課題に系統的に対処します。
この目的を達成するために、ビデオトランスフォーマー表現の単位 (概念) を教師なしで識別し、モデルの出力に対するそれらの重要性をランク付けするための効率的なアプローチを提案します。
結果として得られる概念は高度に解釈可能であり、非構造化ビデオモデルにおける時空間推論メカニズムとオブジェクト中心の表現を明らかにします。
この分析をさまざまな教師付き表現と自己教師付き表現のセットに対して共同で実行すると、これらのメカニズムの一部がビデオトランスフォーマーに共通であることがわかります。
最後に、VTCD が詳細なアクション認識とビデオオブジェクトのセグメンテーションに使用できることを示します。

要約(オリジナル)

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations – concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.

arxiv情報

著者	Matthew Kowal,Achal Dave,Rares Ambrus,Adrien Gaidon,Konstantinos G. Derpanis,Pavel Tokmakov
発行日	2024-04-10 15:19:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding Video Transformers via Universal Concept Discovery

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー