TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

要約

大規模なビデオ言語の事前トレーニングにより、ビデオ言語の理解タスクが大幅に進歩しました。
ただし、ビデオエンコードの重い計算負荷は、特に長時間ビデオの場合、依然として大きな効率のボトルネックとなっています。
これらのビデオには、固有の 3D 特性と時空間的冗長性により大量の視覚トークンが含まれているため、複雑な時間的および空間的関係を捉えることが困難になります。
この問題に取り組むために、TEmporal-Spatial Token Aggregation (TESTA) と呼ばれる効率的な方法を提案します。
TESTA は、同様のフレームと各フレーム内の同様のパッチを適応的に集約することにより、ビデオセマンティクスを凝縮します。
TESTA はビジュアルトークンの数を 75% 削減できるため、ビデオエンコードが高速化されます。
TESTA に基づいて、各ビデオエンコーダーブロックに分割された時空トークン集約モジュールを備えた事前トレーニングされたビデオ言語モデルを導入します。
段落からビデオへの取得と長い形式の VideoQA タスクの 5 つのデータセットでモデルを評価します。
実験結果は、TESTA が計算効率を 1.7 倍向上させ、より長い入力フレームの処理におけるスケーラビリティによる大幅なパフォーマンス向上 (例: QuerYD で +13.7 R@1、Condensed Movie で +6.5 R@1) を達成することを示しています。

要約(オリジナル)

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

arxiv情報

著者	Shuhuai Ren,Sishuo Chen,Shicheng Li,Xu Sun,Lu Hou
発行日	2023-10-29 16:25:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー