Compressed Vision for Efficient Video Understanding

要約

経験や推論は、ミリ秒、秒、時間、日など、さまざまな時間スケールで行われます。しかし、コンピュータビジョンの研究の大部分は、個々の画像や数秒間の短い動画に焦点を当てています。これは、より長い動画を扱うには、よりスケーラブルなアプローチが必要なためである。本研究では、秒単位の動画を処理できるハードウェアで、1時間単位の動画を処理できるフレームワークを提案します。JPEGなどの標準的な動画圧縮をニューラル圧縮に置き換え、圧縮された動画を通常のビデオネットワークの入力として直接与えることができることを示す。圧縮された映像で動作することにより、データ転送、速度、メモリなどすべてのパイプラインレベルで効率が向上し、より速く、より長い映像でモデルを訓練することが可能になる。しかし、圧縮された信号の処理は、素朴に行うと標準的な補強技術が使えないという欠点がある。我々は、潜在的なコードに変換を適用することができる小さなネットワークを導入することによって、元のビデオ空間で一般的に使用される拡張に対応するように対処する。我々は、圧縮ビジョンパイプラインを用いることで、Kinetics600やCOINなどの一般的なベンチマークにおいて、より効率的にビデオモデルを学習できることを実証する。また、標準的なフレームレートの1時間程度のビデオで定義された新しいタスクの概念実証実験も行う。このような長時間の映像の処理は、圧縮表現を用いない限り不可能である。

要約(オリジナル)

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels — data transfer, speed and memory — making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

arxiv情報

著者	Olivia Wiles,Joao Carreira,Iain Barr,Andrew Zisserman,Mateusz Malinowski
発行日	2022-10-06 15:35:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Compressed Vision for Efficient Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー