ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

要約

視覚と言語の概念をより細かいレベルで調整することは、特に参照やグラウンディングなどのタスクにおいて、依然としてマルチモーダル大規模言語モデル (MLLM) の重要なトピックです。
プロキシエンコーディングやジオメトリエンコーディングなどの既存の方法には、空間情報をエンコードするための追加の構文が組み込まれており、言語モジュールと視覚モジュール間の通信時に余分な負担がかかります。
この研究では、より高いレベルのセマンティクスを共同して表現するビジュアルトークンのトークン集合グループを使用して各エンティティを明示的に表記する新しい方法論を提供する ClawMachine を提案します。
離散空間と連続空間の両方からシーンを認識して理解するためのハイブリッド知覚メカニズムも研究されています。
私たちの方法では、追加の構文を使用せずに、視覚的な参照タスクのプロンプトと回答を統合します。
ClawMachine は、共同ビジョン言語語彙を活用することで、自己回帰的な方法で参照とグラウンディングをさらに統合し、スケールアップされた事前トレーニングデータで大きな可能性を実証します。
実験では、ClawMachine がシーンレベルの参照理解タスクで優れたパフォーマンスをより高い効率で達成できることが示されています。
また、多くの MLLM の能力を超えた、複雑な視覚的推論のためにマルチソース情報を統合する可能性も示します。
私たちのコードは github.com/martian422/ClawMachine で入手できます。

要約(オリジナル)

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.

arxiv情報

著者	Tianren Ma,Lingxi Xie,Yunjie Tian,Boyu Yang,Qixiang Ye
発行日	2025-01-23 14:50:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー