CoMemo: LVLMs Need Image Context with Image Memory

要約

大規模な言語モデルに基づいて構築された大規模な視覚言語モデルの最近の進歩により、視覚的特徴は、LLM表現を支配的なパラダイムとして確立しました。
ただし、継承されたLLMアーキテクチャデザインは、マルチモーダル処理の最適な特性を導入します。
第一に、LVLMSは注意割り当てでバイモーダル分布を示し、コンテキストが拡大するにつれて中央の視覚コンテンツの進行性の無視につながります。
第二に、動的な高解像度画像を処理する際に、従来の位置エンコーディングスキームは、重要な2D構造関係を維持できません。
これらの制限に対処するために、COMEMOを提案します。これは、コンテキストイメージパスと視覚処理のための画像メモリパスを組み合わせたデュアルパスアーキテクチャを提案し、視覚情報の無視を効果的に緩和します。
さらに、サムネイルベースの位置凝集を使用して2D空間認識を維持しながら、拡張されたシーケンスでリモート崩壊を軽減する新しい位置エンコーディングメカニズムであるロープDHRを導入します。
長いコンテキストの理解、マルチイメージの推論、視覚的な質問への回答を含む7つのベンチマークにわたる評価は、従来のLVLMアーキテクチャと比較してCOMEMOの優れたパフォーマンスを示しています。
プロジェクトページは、https：//lalbj.github.io/projects/comemo/で入手できます。

要約(オリジナル)

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo – a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo’s superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

arxiv情報

著者	Shi Liu,Weijie Su,Xizhou Zhu,Wenhai Wang,Jifeng Dai
発行日	2025-06-06 17:59:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CoMemo: LVLMs Need Image Context with Image Memory

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー