REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

要約

ポイントプロンプトを使用して地域ベースの画像表現を生成するための高速かつ効果的なモデルであるRegion Encoder Network（REN）を紹介します。
最近の方法では、クラスに依存しないセグメントター（SAMなど）とパッチベースの画像エンコーダー（DINOなど）を組み合わせて、コンパクトで効果的な地域表現を生成しますが、セグメンテーションステップにより高い計算コストに悩まされています。
Renは、領域トークンを直接生成する軽量モジュールを使用してこのボトルネックをバイパスし、35倍少ないメモリで60倍高速なトークン生成を可能にしながら、トークン品質も改善します。
パッチベースの画像エンコーダーのキーと値としてのクエリおよび機能としてポイントプロンプトを使用するいくつかのクロスアテンションブロックを使用して、プロンプトされたオブジェクトに対応する領域トークンを生成します。
RENは、3つの人気のあるエンコーダダノ、DINOV2、およびOpenCLipでトレーニングし、専用のトレーニングなしで他のエンコーダに拡張できることを示しています。
セマンティックセグメンテーションと検索タスクでRenを評価します。ここでは、パフォーマンスとコンパクトさの両方で元のエンコーダーを常に上回り、SAMベースの地域の方法を大幅に高速化しながら一致または上回ります。
特に、Renは挑戦的なEGO4D VQ2Dベンチマークで最先端の結果を達成し、Visual Haystacksのシングルニードルチャレンジで独自のLMMを上回ります。
コードとモデルは、https：//github.com/savya08/renで入手できます。

要約(オリジナル)

We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks’ single-needle challenge. Code and models are available at: https://github.com/savya08/REN.

arxiv情報

著者	Savya Khosla,Sethuraman TV,Barnett Lee,Alexander Schwing,Derek Hoiem
発行日	2025-05-23 17:59:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー