Refer to Anything with Vision-Language Prompts

要約

最近の画像セグメンテーションモデルは、画像を視覚エンティティの高品質のマスクにセグメント化するように進めていますが、言語とビジョンの両方に基づいて複雑なクエリに包括的なセマンティック理解を提供することはできません。
この制限により、ビジョン言語プロンプトによって駆動されるユーザーフレンドリーなインタラクションが必要なアプリケーションでの有効性が低下します。
このギャップを埋めるために、式のセグメンテーション（鉱石）を参照するオムニモダルの新しいタスクを紹介します。
このタスクでは、モデルは、テキストのみまたはテキストと参照ビジュアルエンティティによって指定された任意のプロンプトに基づいてマスクのグループを生成します。
この新しい課題に対処するために、「セグメンテーションマスクグループを参照する」（RAS）に新しいフレームワークを提案します。これは、マスク中心のマルチモーダルモデルを介して複雑なマルチモーダル相互作用と理解を備えたセグメンテーションモデルを増強します。
鉱石モデルのトレーニングとベンチマークのために、データセットMaskGroups-2MとMaskGroups-HQを作成して、テキストと参照エンティティで指定された多様なマスクグループを含めます。
広範な評価を通じて、新しい鉱石タスクでのRAの優れた性能を示し、式の古典的な参照式セグメンテーション（RES）および一般化された照会式セグメンテーション（GRES）タスクを示します。
プロジェクトページ：https：//ref2any.github.io。

要約(オリジナル)

Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to ‘Refer to Any Segmentation Mask Group’ (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

arxiv情報

著者	Shengcao Cao,Zijun Wei,Jason Kuen,Kangning Liu,Lingzhi Zhang,Jiuxiang Gu,HyunJoon Jung,Liang-Yan Gui,Yu-Xiong Wang
発行日	2025-06-05 17:59:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Refer to Anything with Vision-Language Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー