SmolVLM: Redefining small and efficient multimodal models

要約

大規模なビジョン言語モデル（VLM）は、例外的なパフォーマンスを提供しますが、重要な計算リソースが必要であり、モバイルデバイスとエッジデバイスでの展開を制限します。
通常、より小さなVLMは、広範な画像トークン化など、より大きなモデルの設計の選択を反映しており、非効率的なGPUメモリ使用量とデバイス上のアプリケーションの制約のある実用性につながります。
リソース効率の高い推論のために特別に設計された一連のコンパクトマルチモーダルモデルであるSmolvlmを紹介します。
低い計算オーバーヘッド向けに最適化されたアーキテクチャ構成、トークン化戦略、およびデータキュレーションを体系的に調査します。
これを通じて、メモリフットプリントを最小限に抑えた画像タスクとビデオタスクにかなりのパフォーマンスが得られる重要なデザインの選択肢を特定します。
私たちの最小のモデルであるSmolVLM-256Mは、推論中に1GB未満のGPUメモリを使用し、18か月の開発ギャップにもかかわらず、300倍のIDEFICS-80Bモデルを上回ります。
2.2Bパラメーターでの最大のモデルは、GPUメモリの2倍を消費する最先端のVLMに匹敵します。
Smolvlmモデルは静的な画像を超えて拡張され、堅牢なビデオ理解機能が実証されています。
我々の結果は、戦略的アーキテクチャの最適化、積極的でありながら効率的なトークン化、および慎重にキュレーションされたトレーニングデータがマルチモーダルパフォーマンスを大幅に向上させ、大幅に小さなスケールで実用的でエネルギー効率の高い展開を促進することを強調しています。

要約(オリジナル)

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

arxiv情報

著者	Andrés Marafioti,Orr Zohar,Miquel Farré,Merve Noyan,Elie Bakouch,Pedro Cuenca,Cyril Zakka,Loubna Ben Allal,Anton Lozhkov,Nouamane Tazi,Vaibhav Srivastav,Joshua Lochner,Hugo Larcher,Mathieu Morlon,Lewis Tunstall,Leandro von Werra,Thomas Wolf
発行日	2025-04-07 17:58:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SmolVLM: Redefining small and efficient multimodal models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー