InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

要約

長い文脈の入出力をサポートする汎用的なラージビジョン言語モデル、InternLM-XComposer-2.5 (IXC-2.5)を紹介する。IXC-2.5は、様々なテキスト-画像理解および合成アプリケーションに優れており、7B LLMバックエンドのみでGPT-4Vレベルの能力を達成する。24Kのインターリーブ画像-テキストコンテキストで訓練されたIXC-2.5は、RoPE外挿により96Kのロングコンテキストまでシームレスに拡張できます。このロング・コンテキスト機能により、IXC-2.5は広範な入力および出力コンテキストを必要とするタスクで優れた性能を発揮します。旧バージョン2.0と比較して、InternLM-XComposer-2.5は視覚言語理解における3つの大きなアップグレードを特徴としています：(1) 超高解像度理解、(2) 細かいビデオ理解、(3) 多回転多画像対話。さらにIXC-2.5は、理解に加えて、LoRAの追加パラメータを用いたテキスト画像合成という、2つの魅力的なアプリケーションにも拡張されている：(1) ウェブページの作成、(2) 高品質のテキスト画像記事の作成。IXC-2.5は28のベンチマークで評価され、16のベンチマークで既存のオープンソースの最先端モデルを上回った。また、16の主要タスクにおいてGPT-4VやGemini Proを上回る、あるいは拮抗しています。InternLM-XComposer-2.5は、https://github.com/InternLM/InternLM-XComposer。

要約(オリジナル)

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

arxiv情報

著者	Pan Zhang,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Rui Qian,Lin Chen,Qipeng Guo,Haodong Duan,Bin Wang,Linke Ouyang,Songyang Zhang,Wenwei Zhang,Yining Li,Yang Gao,Peng Sun,Xinyue Zhang,Wei Li,Jingwen Li,Wenhai Wang,Hang Yan,Conghui He,Xingcheng Zhang,Kai Chen,Jifeng Dai,Yu Qiao,Dahua Lin,Jiaqi Wang
発行日	2024-07-03 17:59:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー