MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

要約

モバイルデバイス上で実行することを目的とした有能なマルチモーダルビジョン言語モデル (MMVLM) である MobileVLM を紹介します。
これは、モバイル指向の無数のアーキテクチャ設計と技術の融合であり、ゼロからトレーニングされた 1.4B および 2.7B パラメーターのスケールの言語モデルのセットと、事前にトレーニングされたマルチモーダルビジョンモデルで構成されます。
CLIP ファッション、効率的なプロジェクターを介したクロスモダリティインタラクション。
MobileVLM をいくつかの典型的な VLM ベンチマークで評価します。
当社のモデルは、いくつかのはるかに大きなモデルと比較して同等のパフォーマンスを示します。
さらに重要なのは、Qualcomm Snapdragon 888 CPU と NVIDIA Jeston Orin GPU の両方で推論速度を測定し、それぞれ 1 秒あたり 21.5 トークンと 65.3 トークンという最先端のパフォーマンスが得られたことです。
私たちのコードは、https://github.com/Meituan-AutoML/MobileVLM で公開されます。

要約(オリジナル)

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

arxiv情報

著者	Xiangxiang Chu,Limeng Qiao,Xinyang Lin,Shuang Xu,Yang Yang,Yiming Hu,Fei Wei,Xinyu Zhang,Bo Zhang,Xiaolin Wei,Chunhua Shen
発行日	2023-12-30 04:59:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー