Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

要約

最近、ビジョンモデルの事前トレーニングは、手動で注釈が付けられたデータセットへの依存から、Web クロールされた大規模な画像テキストデータの活用へと進化しました。
これらの進歩にもかかわらず、インターネット上で広く普及している、インターリーブされた画像とテキストのデータを効果的に活用する事前トレーニング方法はありません。
自然言語処理における圧縮学習の最近の成功に触発されて、我々は、インターリーブされた画像とテキストのデータに対する潜在圧縮学習 (LCL) と呼ばれる新しい視覚モデルの事前トレーニング方法を提案します。
この手法は、因果的注意モデルの入力と出力間の相互情報を最大化することで潜在圧縮学習を実行します。
トレーニング目標は、2 つの基本タスクに分解できます。1) 視覚的表現と先行するコンテキストの間の対比学習、2) 視覚的表現に基づいた後続のテキストの生成。
私たちの実験では、私たちの方法がペアの事前トレーニングデータセット (例: LAION) での CLIP のパフォーマンスに匹敵するだけでなく、インターリーブされた事前トレーニングデータ (例: MMC4) を活用して堅牢な視覚表現をゼロから学習できることを実証し、可能性を示しています。
インターリーブされた画像とテキストのデータを使用した視覚モデルの事前トレーニング。
コードは https://github.com/OpenGVLab/LCL で公開されています。

要約(オリジナル)

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representation from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. Code is released at https://github.com/OpenGVLab/LCL.

arxiv情報

著者	Chenyu Yang,Xizhou Zhu,Jinguo Zhu,Weijie Su,Junjie Wang,Xuan Dong,Wenhai Wang,Lewei Lu,Bin Li,Jie Zhou,Yu Qiao,Jifeng Dai
発行日	2024-12-20 17:24:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー