YOLOE: Real-Time Seeing Anything

要約

オブジェクトの検出とセグメンテーションはコンピュータービジョンアプリケーションで広く採用されていますが、ヨロシリーズのような従来のモデルは、効率的で正確であるが、事前定義されたカテゴリによって制限され、オープンシナリオでの適応性を妨げます。
最近のオープンセットの方法は、これを克服するためにテキストプロンプト、視覚的な手がかり、またはプロンプトフリーパラダイムを活用しますが、多くの場合、高い計算需要または展開の複雑さによりパフォーマンスと効率性を妥協します。
この作業では、単一の非常に効率的なモデル内の多様なオープンプロンプトメカニズム全体で検出とセグメンテーションを統合し、何でもリアルタイムで達成するヨーローを紹介します。
テキストのプロンプトについては、再パラメーター化可能な領域テキストアライメント（REPRTA）戦略を提案します。
再パラメーター化可能な軽量補助ネットワークを介して、前処理されたテキスト埋め込みを改良し、ゼロ推論とオーバーヘッドの転送で視覚的テキストアライメントを強化します。
視覚的なプロンプトについては、セマンティックアクティブ化された視覚プロンプトエンコーダー（SAVPE）を提示します。
デカップされたセマンティックおよびアクティベーションブランチを使用して、視覚的な埋め込みと精度を最小限に抑えて改善します。
プロンプトフリーシナリオについては、Lazy Region-Prompt Contrast（LRPC）戦略を紹介します。
コストのかかる言語モデルの依存関係を避けるために、すべてのオブジェクトを識別するために、組み込みの大きな語彙と特殊な埋め込みを利用します。
広範な実験では、ヨーローの並外れたゼロショットパフォーマンスと、高い耐久効率と低トレーニングコストを備えた転送可能性が示されています。
特に、LVIでは、3 $ \ Times $のトレーニングコストと1.4 $ \ Times $ Inference Speepupで、Yoloe-V8-Sは3.5 APでYolo-Worldv2-Sを上回ります。
ココに移動すると、Yoloe-V8-Lは0.6 AP $^b $および0.4 AP $^m $の閉じたセットYolov8-Lを獲得し、トレーニング時間が約4 $ \ times $ $ \ timesを達成します。
コードとモデルはhttps://github.com/thu-mig/yoloeで入手できます。

要約(オリジナル)

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE’s exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

arxiv情報

著者	Ao Wang,Lihao Liu,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding
発行日	2025-03-10 15:42:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

YOLOE: Real-Time Seeing Anything

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー