CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

要約

タイトル：CLIP-Lite：言語教示による情報効率の高い視覚表現学習

要約：画像とテキストのアノテーションを使用した特徴の整列による視覚表現学習のための情報効率の高い方法、CLIP-Liteを提案する。以前提案されたCLIPモデルと比較して、CLIP-Liteは対象となる画像とテキストに対して負のサンプルペアが1つだけ必要で、学習目的のコントラスト学習を最適化する際に必要である。これは、両入力モダリティ間の相互情報量を最大化するための情報効率の高い下限を利用することにより達成される。これにより、CLIP-Liteが大幅に少ない量のデータとバッチサイズで訓練可能であり、同じ規模のCLIPに比べてより良いパフォーマンスが得られるようになった。 COCO-Captionsデータセットで事前学習したCLIP-Liteを評価し、他のデータセットへの転移学習をテストした。 CLIP-Liteは、Pascal VOC分類で14.0％のmAPの絶対的な利益、ImageNetで22.1％のトップ1精度の利益を得るようになった。CLIP-Liteは、他のより複雑なテキスト教示モデルと比較して、同等または優れた性能を示す。 CLIP-Liteは、画像とテキストの検索、ゼロショット分類、および視覚的に接続されたようなタスクでも、CLIPよりも優れていることがわかった。最後に、CLIP-Liteが言語意味を活用して、ダウンストリームタスクで使用できるバイアスのない視覚表現を促進することができることを示す。成果物：https://github.com/4m4n5/CLIP-Lite

要約(オリジナル)

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

arxiv情報

著者	Aman Shrivastava,Ramprasaath R. Selvaraju,Nikhil Naik,Vicente Ordonez
発行日	2023-05-11 13:47:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー