Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

要約

【タイトル】整合されていないビデオとテキストデータを使用したスケーラブルで正確な自己教示多モーダル表現学習

【要約】
– 弱い教師ありデータセットの拡大は、画像テキストドメインで非常に効果的であり、最近のほとんどの最新のコンピュータビジョンおよび多モーダルニューラルネットワークに貢献している
– 自動音声認識（ASR）を介したビデオテキストデータマイニングアプローチは、HowTo100Mで使用されている。しかし、低品質の字幕を提供することが多く、動画の内容に言及しないことがある。
– 厳密な言語の説明を提供しない他のマイニングアプローチは、短いクリップにバイアスがかかっている。
– この研究では、最近の画像キャプショニングにおける先行研究が、パラレルなビデオテキストデータなしでも高品質のビデオモデルの事前トレーニングを可能にすることを示している。
– 最も近いビデオ翻訳モデルをいくつかのビデオ翻訳データセットで事前トレーニングし、その効果を示す。
– 元々存在するHowTo100M字幕よりも、画像キャプショニングの疑似ラベルが事前トレーニングにより優れていることを示した
– 画像とビデオの両方で事前トレーニングすることで、どちらかのモーダリティでの事前トレーニングよりも大幅に優れたネットワークを作成できることを示した（MSR-VTTでは+4CIDER）
– この方法は、既存の事前トレーニングまたはデータマイニングアプローチと補完的であり、さまざまな設定で使用できる。
– 疑似ラベリング法の有効性を考慮して、生成された字幕を公開する予定である。

要約(オリジナル)

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.

arxiv情報

著者	Vladislav Lialin,Stephen Rawls,David Chan,Shalini Ghosh,Anna Rumshisky,Wael Hamza
発行日	2023-04-04 19:11:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー