Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

要約

この作業では、Vid2Seq を紹介します。これは、大規模にすぐに利用できるナレーション付きビデオで事前トレーニングされた、マルチモーダルな単一ステージの高密度イベントキャプションモデルです。
Vid2Seq アーキテクチャは、言語モデルを特別な時間トークンで補強し、同じ出力シーケンスでイベント境界とテキスト記述をシームレスに予測できるようにします。
このような統合モデルには、現在の注釈付きデータセットでは利用できない大規模なトレーニングデータが必要です。
転写された音声の文の境界を疑似イベント境界として再定式化し、転写された音声文を疑似イベントのキャプションとして使用することにより、ラベルのないナレーション付きビデオを高密度のビデオキャプションに活用できることを示します。
YT-Temporal-1B データセットで事前トレーニングされた結果の Vid2Seq モデルは、YouCook2、ViTT、ActivityNet Captions など、さまざまな密度の高いビデオキャプションベンチマークの最先端を向上させます。
Vid2Seq はまた、ビデオパラグラフキャプションおよびビデオクリップキャプションのタスク、および少数ショットの設定にも一般化されます。
コードは https://antoyang.github.io/vid2seq.html で公開されています。

要約(オリジナル)

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html.

arxiv情報

著者	Antoine Yang,Arsha Nagrani,Paul Hongsuck Seo,Antoine Miech,Jordi Pont-Tuset,Ivan Laptev,Josef Sivic,Cordelia Schmid
発行日	2023-03-21 11:01:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー