Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

要約

大規模マルチモーダルモデル (LMM) は、視覚言語タスクでは有望であることが示されていますが、高解像度の入力と詳細なシーンの理解には苦労しています。
これらの課題に対処するために、LMM 機能を強化するために Monkey を導入しました。
まず、Monkey は、入力画像を均一なパッチに分割することによって処理します。各パッチは、よく訓練されたビジョンエンコーダの元のトレーニングで使用されたサイズ (例: 448×448) に一致します。
各パッチに個別のアダプターを装備した Monkey は、最大 1344×896 ピクセルまでの高解像度を処理できるため、複雑な視覚情報を詳細にキャプチャできます。
第 2 に、マルチレベル記述生成方法を採用し、シーンとオブジェクトの関連付けのコンテキストを強化します。
この 2 つの部分からなる戦略により、生成されたデータからより効果的に学習することができます。解像度が高いほど、ビジュアルをより詳細にキャプチャできるため、包括的な説明の有効性が高まります。
広範なアブレーション結果により、当社の設計の有効性が検証されています。
さらに、18 のデータセットでの実験により、Monkey が画像キャプションやさまざまな視覚的質問応答形式などの多くのタスクにおいて既存の LMM を上回ることがさらに実証されました。
特に、高密度テキストの質問応答に焦点を当てた定性テストでは、Monkey は GPT4V と比較して有望な結果を示しました。
コードは https://github.com/Yuliang-Liu/Monkey で入手できます。

要約(オリジナル)

Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges, we introduce Monkey to enhance LMM capabilities. Firstly, Monkey processes input images by dividing them into uniform patches, each matching the size (e.g., 448×448) used in the original training of the well-trained vision encoder. Equipped with individual adapter for each patch, Monkey can handle higher resolutions up to 1344×896 pixels, enabling the detailed capture of complex visual information. Secondly, it employs a multi-level description generation method, enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data: the higher resolution allows for a more detailed capture of visuals, which in turn enhances the effectiveness of comprehensive descriptions. Extensive ablative results validate the effectiveness of our designs. Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats. Specially, in qualitative tests focused on dense text question answering, Monkey has exhibited encouraging results compared with GPT4V. Code is available at https://github.com/Yuliang-Liu/Monkey.

arxiv情報

著者	Zhang Li,Biao Yang,Qiang Liu,Zhiyin Ma,Shuo Zhang,Jingxu Yang,Yabo Sun,Yuliang Liu,Xiang Bai
発行日	2024-08-26 06:57:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー