InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

要約

自由形式のテキストと画像の構成と理解に優れた最先端のビジョン言語モデルである InternLM-XComposer2 を紹介します。
このモデルは、従来の視覚言語の理解を超え、アウトライン、詳細なテキスト仕様、参照画像などの多様な入力からインターリーブされたテキスト画像コンテンツを巧みに作成し、高度にカスタマイズ可能なコンテンツ作成を可能にします。
InternLM-XComposer2 は、追加の LoRA パラメーターを画像トークンのみに適用して、事前トレーニングされた言語知識の完全性を維持し、正確な視覚理解と文学的才能によるテキスト構成のバランスを取る部分 LoRA (PLoRA) アプローチを提案します。
実験結果は、高品質な長文マルチモーダルコンテンツの生成における InternLM2-7B ベースの InternLM-XComposer2 の優位性と、さまざまなベンチマークにわたる優れた視覚言語理解パフォーマンスを実証しており、既存のマルチモーダルモデルを大幅に上回るだけでなく、
あるいは、特定の評価では GPT-4V や Gemini Pro を上回っています。
これは、マルチモーダルな理解の領域におけるその驚くべき熟練度を強調しています。
7B パラメーターを備えた InternLM-XComposer2 モデルシリーズは、https://github.com/InternLM/InternLM-XComposer で公開されています。

要約(オリジナル)

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

arxiv情報

著者	Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Bin Wang,Linke Ouyang,Xilin Wei,Songyang Zhang,Haodong Duan,Maosong Cao,Wenwei Zhang,Yining Li,Hang Yan,Yang Gao,Xinyue Zhang,Wei Li,Jingwen Li,Kai Chen,Conghui He,Xingcheng Zhang,Yu Qiao,Dahua Lin,Jiaqi Wang
発行日	2024-01-29 18:59:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー