Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

要約

ラージビジョン言語モデル (LVLM) のマルチモーダル事前トレーニング品質を示す、効果的で堅牢かつ一般化された指標であるモダリティ統合率 (MIR) を紹介します。
大規模な事前トレーニングは、有能な LVLM を構築する上で重要な役割を果たしますが、コストのかかる監視付き微調整段階なしでトレーニングの品質を評価する方法は十分に検討されていません。
損失、パープレキシティ、およびインコンテキスト評価の結果は、大規模言語モデル (LLM) のトレーニング前のメトリクスとしてよく使用されますが、十分にトレーニングされた LLM を新しいモダリティと調整する場合、これらのメトリクスはあまり示唆的ではないことが観察されました。
適切な指標が欠如しているため、トレーニングデータの選択、効率的なモジュール設計などを含む、重要な事前トレーニング段階での LVLM の研究が大幅に妨げられています。この論文では、事前トレーニングの品質を相互に評価することを提案します。
– モーダル分布距離の観点と現在の MIR、モダリティ統合率。これは 1) \textbf{Effective} で、トレーニング前の品質を表し、教師付き微調整後のベンチマークパフォーマンスと正の関係を示します。
2) さまざまなトレーニング/評価データに対して \textbf{堅牢}。
3) トレーニング構成とアーキテクチャの選択全体にわたって \textbf{一般化}します。
私たちは、MIR の有効性を調査するために一連の事前トレーニング実験を実施し、より良い事前トレーニング結果を得るために MIR がトレーニングデータの選択、トレーニング戦略のスケジュール、モデルアーキテクチャの設計を示すという満足のいく結果を観察しました。
MIR が有能な LVLM を構築するための有用な指標となり、さまざまな分野でのモダリティ調整に関する次の研究に影響を与えることを願っています。
私たちのコードは https://github.com/sekiw/Modality-Integration-Rate にあります。

要約(オリジナル)

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

arxiv情報

著者	Qidong Huang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin,Weiming Zhang,Nenghai Yu
発行日	2024-10-09 17:59:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー