SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

要約

元のSiglipの成功に基づいて構築された新しい多言語ビジョン言語エンコーダーのファミリーであるSiglip 2を紹介します。
この2回目の反復では、いくつかの事前の独立した開発技術を備えた元の画像テキストトレーニング目標を統一されたレシピに拡張します。これには、キャプションベースの事前削除、自己監視の損失（自己抵抗、マスクされた予測）、およびオンラインデータキュレーションが含まれます。
。
これらの変更により、Siglip 2モデルは、ビジョン言語モデル（VLM）の視覚表現を抽出する際のゼロショット分類、画像テキスト検索、転送パフォーマンスなど、コア機能のすべてのモデルスケールでSiglipのカウンターパートを上回ります。
さらに、新しいトレーニングレシピは、ローカリゼーションと密集した予測タスクの大幅な改善につながります。
また、複数の解像度をサポートし、入力のネイティブアスペクト比を保持するバリエーションをトレーニングします。
最後に、バイアシング技術を含む、より多様なデータミックスをトレーニングし、多言語の理解と公平性の向上につながります。
ユーザーがパフォーマンスで推論コストをトレードオフできるようにするために、VIT-B（86M）、L（303M）、SO400M（400M）、およびG（1B）の4つのサイズでモデルチェックポイントをリリースします。

要約(オリジナル)

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe — this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

arxiv情報

著者	Michael Tschannen,Alexey Gritsenko,Xiao Wang,Muhammad Ferjad Naeem,Ibrahim Alabdulmohsin,Nikhil Parthasarathy,Talfan Evans,Lucas Beyer,Ye Xia,Basil Mustafa,Olivier Hénaff,Jeremiah Harmsen,Andreas Steiner,Xiaohua Zhai
発行日	2025-02-20 18:08:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー