Scaling Law Hypothesis for Multimodal Model

要約

共有トークンと埋め込み空間内でテキスト、オーディオ、画像、ビデオを処理するマルチモーダルモデルのスケーリング則仮説を提案します。
私たちのフレームワークは、モダリティ固有の圧縮とトークン化の効率に基づいてモデルのパフォーマンスを予測し、確立されたスケーリング則をテキストベースのデコーダモデルから混合モダリティシステムに拡張します。
複数のモダリティでより多くのトレーニングデータを活用することでマルチモーダルモデルのサイズを削減し、リソースに制約のあるデバイスでの効率的な展開が可能になるかどうかを調査します。

要約(オリジナル)

We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.

arxiv情報

著者	Qingyun Sun,Zhen Guo,PIN AI Team
発行日	2024-11-11 18:32:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Law Hypothesis for Multimodal Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー