MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

要約

フェデレーテッドラーニング (FL) に関するこれまでの研究では、異なるクライアント間のデータの異質性によるパフォーマンスの低下が発生することがよくありました。
GPT-4v や LLaVA などのマルチモーダル大規模言語モデル (MLLM) の最近の進歩を考慮すると、画像キャプションやマルチモーダルな質問応答などのマルチモーダルタスクにおける優れた熟練度が実証されています。
マルチモーダル大規模言語モデル支援フェデレーテッドラーニング (MLLM-LLaVA-FL) という名前の新しいフェデレーテッドラーニングフレームワークを導入します。このフレームワークは、サーバーエンドで強力な MLLM を採用して、異種混合およびロングテールの課題に対処します。
高度なクロスモダリティ表現機能と MLLM に関する広範なオープン語彙の事前知識のおかげで、私たちのフレームワークは、Web サイトからアクセスできる広範でありながらこれまで活用されていなかったオープンソースデータや強力なサーバー側の計算リソースを活用することに熟達しています。
したがって、MLLM-LLaVA-FL はパフォーマンスを向上させるだけでなく、プライバシー漏洩のリスクとローカルデバイスの計算負荷の増加を回避し、従来の方法論とは異なります。
私たちのフレームワークには 3 つの主要な段階があります。
最初に、モデルのグローバルなビジュアルテキスト事前トレーニングを実行します。
この事前トレーニングは、MLLM の支援を受けて、オンラインで入手可能な広範なオープンソースデータを利用することで促進されます。
その後、事前トレーニングされたモデルは、ローカルトレーニングのためにさまざまなクライアントに配布されます。
最後に、ローカルでトレーニングされたモデルがサーバーに送信されると、MLLM の監督の下でグローバルアライメントが実行され、パフォーマンスがさらに向上します。
確立されたベンチマークでの実験評価により、フロリダ州のさまざまなクライアント間でのデータの異質性とロングテール分散を伴う一般的なシナリオにおいて、当社のフレームワークが有望なパフォーマンスを提供することがわかりました。

要約(オリジナル)

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

arxiv情報

著者	Jianyi Zhang,Hao Frank Yang,Ang Li,Xin Guo,Pu Wang,Haiming Wang,Yiran Chen,Hai Li
発行日	2024-12-02 10:18:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー