Accurate and Fast Compressed Video Captioning

要約

既存のビデオキャプション手法では、通常、最初にデコードされたビデオからビデオフレームをサンプリングし、その後、後続のプロセス（たとえば、特徴抽出および/またはキャプションモデル学習）を実行する必要があります。
このパイプラインでは、手動フレームサンプリングによりビデオ内の重要な情報が無視されるため、パフォーマンスが低下する可能性があります。
さらに、サンプリングされたフレーム内の冗長な情報により、ビデオキャプションの推論の効率が低下する可能性があります。
これに対処するために、私たちは圧縮ドメインにおける別の視点からビデオキャプションを研究します。これにより、既存のパイプラインに比べて何倍もの利点が得られます。 1) デコードされたビデオからの生の画像と比較すると、圧縮ビデオは、I フレーム、動きベクトル、および
残差は非常に区別しやすいため、特殊なモデル設計を通じて手動サンプリングを行わずにビデオ全体を学習に活用できます。
2) キャプションモデルは、より小さく冗長な情報が処理されるため、推論の効率が高くなります。
我々は、ビデオキャプション用の圧縮領域におけるシンプルかつ効果的なエンドツーエンド変換器を提案します。これにより、キャプション用の圧縮ビデオから学習できるようになります。
単純な設計であっても、私たちの手法は、既存のアプローチよりもほぼ 2 倍高速に実行しながら、さまざまなベンチマークで最先端のパフォーマンスを達成できることを示します。
コードは https://github.com/acherstyx/CoCap で入手できます。

要約(オリジナル)

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.

arxiv情報

著者	Yaojie Shen,Xin Gu,Kai Xu,Heng Fan,Longyin Wen,Libo Zhang
発行日	2023-09-22 13:43:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accurate and Fast Compressed Video Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー