SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

要約

視覚言語理解の進歩にもかかわらず、マルチモーダルアーキテクチャに画像分割を実装することは、現代の人工知能システムの基本的な課題として残っている。既存の視覚言語モデルは、主にバックボーンアーキテクチャやCLIPベースの埋め込み学習に依存しており、きめ細かな空間的定位や操作能力に固有の限界がある。本稿ではSJTUを紹介する：マルチモーダルモデルにおける空間判断 – 座標検出による統一的セグメンテーションに向けて」は、視覚と言語の相互作用と正確なセグメンテーションを橋渡しするために空間座標理解を活用する新しいフレームワークであり、自然言語指示による正確なターゲット識別を可能にする。このフレームワークは、マルチモーダルな空間推論に基づき、セグメンテーション技術を視覚言語モデルと統合するための新しいアプローチを提案する。バウンディングボックスのための正規化座標検出を活用し、それを実用的なセグメンテーション出力に変換することで、マルチモーダルな空間表現と言語表現の統合の可能性を探る。提案された技術的アプローチに基づき、フレームワークは、正確なオブジェクトセグメンテーションだけでなく、様々なベンチマークデータセットにおいて優れた性能を示す。一般的な物体検出のためのCOCO 2017データセットと、意味的セグメンテーションのためのPascal VOCデータセットにおける結果は、フレームワークの汎化能力を示す。

要約(オリジナル)

Despite advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in multimodal models – Towards Unified segmentation through coordinate detection, a novel framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework proposes a novel approach for integrating segmentation techniques with vision-language models based on multimodal spatial inference. By leveraging normalized coordinate detection for bounding boxes and translating it into actionable segmentation outputs, we explore the possibility of integrating multimodal spatial and language representations. Based on the proposed technical approach, the framework demonstrates superior performance on various benchmark datasets as well as accurate object segmentation. Results on the COCO 2017 dataset for general object detection and Pascal VOC datasets for semantic segmentation demonstrate the generalization capabilities of the framework.

arxiv情報

著者	Joongwon Chae,Zhenyu Wang,Peiwu Qin
発行日	2024-12-03 16:53:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー