Classification Done Right for Vision-Language Pre-Training

要約

画像テキストデータに対する視覚言語の事前トレーニングのための非常にシンプルな分類方法である SuperClass を紹介します。
テキストエンコーダと対照的な対照的な CLIP とは異なり、SuperClass は追加のテキストフィルタリングや選択を必要とせず、トークン化された生のテキストを教師付き分類ラベルとして直接利用します。
対照的なターゲットとしてテキストエンコーディングが存在しないため、SuperClass はテキストエンコーダを必要とせず、CLIP のように大きなバッチサイズを維持する必要もありません。
SuperClass は、古典的なコンピュータビジョンベンチマークやビジョン言語のダウンストリームタスクなど、さまざまなダウンストリームタスクで優れたパフォーマンスを実証しました。
私たちは、モデルサイズ、トレーニングの長さ、またはデータサイズに関する SuperClass のスケーリング動作をさらに調査し、有望な結果と CLIP との比較を報告しました。
https://github.com/x-cls/superclass

要約(オリジナル)

We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass

arxiv情報

著者	Huang Zilong,Ye Qinghao,Kang Bingyi,Feng Jiashi,Fan Haoqi
発行日	2024-11-05 18:58:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Classification Done Right for Vision-Language Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー