HyperSeg: Towards Universal Visual Segmentation with Large Language Model

要約

この論文は、Visual Large Language Model (VLLM) によって強化された強力な推論能力を使用して、画像およびビデオの認識のためのユニバーサルセグメンテーションに取り組むことを目的としています。
現在の統合セグメンテーション手法は大幅に進歩しているにもかかわらず、画像とビデオの両方のシナリオへの適応や複雑な推論によるセグメンテーションの制限により、さまざまな困難な指示を処理し、きめ細かい視覚と言語の相関関係を正確に理解することが困難になっています。
。
我々は、一般的なセグメンテーションタスクと、強力な推論能力と世界知識を必要とするより複雑な推論知覚タスクを包含する、ピクセルレベルの画像およびビデオ知覚のための初のVLLMベースのユニバーサルセグメンテーションモデルであるHyperSegを提案します。
さらに、VLLM の認識機能と詳細な視覚情報を最大限に活用するために、HyperSeg には、さまざまなセグメンテーションタスク用のハイブリッドエンティティ認識および詳細な視覚認識モジュールが組み込まれています。
HyperSeg は時間アダプターと組み合わせることで、時間情報の包括的な理解を実現します。
実験結果は、より複雑な推論認識タスクを含む、普遍的な画像およびビデオのセグメンテーションタスクを解決する際の洞察の有効性を検証します。
私たちのコードが利用可能です。

要約(オリジナル)

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

arxiv情報

著者	Cong Wei,Yujie Zhong,Haoxian Tan,Yong Liu,Zheng Zhao,Jie Hu,Yujiu Yang
発行日	2024-11-26 17:18:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー