Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models

要約

効率的なマルチモーダル大規模言語モデル (EMLLM) は最近急速に進歩しています。
思考連鎖 (CoT) 推論と段階的な自己評価を組み込むことで、パフォーマンスが向上しました。
ただし、パラメーターが制限されているため、EMLLM が推論中に自己評価を効果的に使用することが困難になることがよくあります。
主な課題には、評価データの合成、その量の決定、トレーニングと推論戦略の最適化、適切なプロンプトの選択が含まれます。
これらの問題に対処するために、私たちは自己評価拡張トレーニング (SEAT) を導入します。
SEAT は、CoT 推論、データ選択、評価生成により強力な EMLLM を使用し、合成されたデータを使用して EMLLM をトレーニングします。
ただし、長いプロンプトの処理と CoT 推論の品質の維持には問題があります。
したがって、私たちは、長いプロンプトをタスク固有の短いカスケードプロンプトに分割し、リソースが限られた設定のコストを削減する、カスケード自己評価拡張トレーニング (Cas-SEAT) を提案します。
データ合成中に、オープンソースの 7B パラメーター EMLLM を採用し、短いプロンプトで小さなデータセットに注釈を付けます。
実験では、Cas-SEAT が EMLLM の自己評価能力を大幅に向上させ、MathVista、Math-V、We-Math データセットでそれぞれ 19.68%、55.57%、46.79% パフォーマンスを向上させることが実証されました。
さらに、当社の Cas-SEAT データセットは、EMLLM の自己評価を強化するための将来の研究のための貴重なリソースとして機能します。

要約(オリジナル)

Efficient Multimodal Large Language Models (EMLLMs) have rapidly advanced recently. Incorporating Chain-of-Thought (CoT) reasoning and step-by-step self-evaluation has improved their performance. However, limited parameters often hinder EMLLMs from effectively using self-evaluation during inference. Key challenges include synthesizing evaluation data, determining its quantity, optimizing training and inference strategies, and selecting appropriate prompts. To address these issues, we introduce Self-Evaluation Augmented Training (SEAT). SEAT uses more powerful EMLLMs for CoT reasoning, data selection, and evaluation generation, then trains EMLLMs with the synthesized data. However, handling long prompts and maintaining CoT reasoning quality are problematic. Therefore, we propose Cascaded Self-Evaluation Augmented Training (Cas-SEAT), which breaks down lengthy prompts into shorter, task-specific cascaded prompts and reduces costs for resource-limited settings. During data synthesis, we employ open-source 7B-parameter EMLLMs and annotate a small dataset with short prompts. Experiments demonstrate that Cas-SEAT significantly boosts EMLLMs’ self-evaluation abilities, improving performance by 19.68%, 55.57%, and 46.79% on the MathVista, Math-V, and We-Math datasets, respectively. Additionally, our Cas-SEAT Dataset serves as a valuable resource for future research in enhancing EMLLM self-evaluation.

arxiv情報

著者	Zheqi Lv,Wenkai Wang,Jiawei Wang,Shengyu Zhang,Fei Wu
発行日	2025-01-10 02:28:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー