TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

要約

基礎モデルの最終目標は、タスクに依存しない、つまりタスク固有の微調整を行わずにすぐに使える使用をサポートすることです。
自然言語処理と画像表現の学習では画期的な進歩が見られましたが、時空間信号の不確実性が増大しているため、ビデオモデルがそこに到達することは依然として困難です。
トレーニングを容易にするために、既存の研究では画像基盤モデルの事前知識を活用し、それらに効率的な時間モジュールを装備しています。
満足のいく微調整パフォーマンスにもかかわらず、ベースラインの対応物と比較してゼロショット/リニアプロトコルのパフォーマンスがさらに低下していることを考慮すると、すぐに使用できる状態には達していないことが経験的にわかります。
本研究では、言語監督の歪みという観点から、劣化をもたらす要因を分析します。
以前の研究で行ったように、テキストエンコーダをエンドツーエンドで調整することは、スタイルの点で過剰適合する可能性があり、それによってさまざまな言語レジスタのセマンティクスを捕捉する本来の一般化能力が失われる可能性があるため、最適ではないと主張します。
過剰適合されたテキストエンコーダは、有害な監視信号を提供し、ビデオ表現を劣化させます。
この問題に取り組むために、調整可能な深い層でタスク関連のセマンティクスをキャプチャできるようにしながら、浅い層をフリーズすることでテキストエンコーダの汎化能力を維持する、劣化のない事前トレーニング戦略を提案します。
トレーニングの目的に関しては、スケーラブルなトレーニングを可能にするマスキング技術を組み込んだ TVTS のトランスクリプト並べ替えタスクを採用しました。
その結果、最大 10 億のパラメーターを備えた TVTSv2 と呼ばれる一連のモデルが作成されます。
凍結されたバックボーンを使用して、さまざまなビデオベンチマークで新しい最先端の技術を達成し、最近の ImageBind、InternVideo などを上回ります。コードは https://github.com/TencentARC/TVTS で入手できます。

要約(オリジナル)

The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models’ prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS.

arxiv情報

著者	Ziyun Zeng,Yixiao Ge,Zhan Tong,Xihui Liu,Shu-Tao Xia,Ying Shan
発行日	2023-05-23 15:44:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー