Multimodal Federated Learning via Contrastive Representation Ensemble

要約

最新のモバイルシステムと IoT インフラストラクチャでマルチメディアデータの量が増加するにつれて、ユーザーのプライバシーを侵害することなく、これらの豊富なマルチモーダルデータを活用することが重要な問題になります。
フェデレーテッドラーニング (FL) は、一元化された機械学習に代わるプライバシーを意識した代替手段として機能します。
ただし、マルチモーダルデータに拡張された既存の FL メソッドはすべて、単一モダリティレベルでのモデル集約に依存しているため、サーバーとクライアントがモダリティごとに同一のモデルアーキテクチャを持つことが制限されます。
これにより、タスクの多様性は言うまでもなく、モデルの複雑さとデータ容量の両方の点でグローバルモデルが制限されます。
この作業では、マルチモーダル FL (CreamFL) の Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL) を提案します。これは、パブリックデータセットに関する知識のみを伝達しながら、異種モデルアーキテクチャとデータモダリティを備えたクライアントから大規模なサーバーモデルをトレーニングできるマルチモーダル連合学習フレームワークです。
より優れたマルチモーダル表現融合を実現するために、クライアント表現を集約するためのグローバル/ローカルクロスモーダルアンサンブル戦略を設計します。
マルチモーダルの不一致 (モダリティギャップとタスクギャップ) に起因する 2 つの前例のない異種要因によって引き起こされるローカルモデルのドリフトを軽減するために、ローカルトレーニングを正規化する 2 つのインターモーダルおよびイントラモーダルコントラストを提案します。
クライアントをモーダル化し、グローバルなコンセンサスに向かうようにローカルクライアントを正規化します。
画像テキスト検索と視覚的質問応答タスクに関する徹底的な評価とアブレーション研究は、最先端の FL 法とその実用的価値に対する CreamFL の優位性を示しています。

要約(オリジナル)

With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL), a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (modality gap and task gap), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.

arxiv情報

著者	Qiying Yu,Yang Liu,Yimu Wang,Ke Xu,Jingjing Liu
発行日	2023-02-17 14:17:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Federated Learning via Contrastive Representation Ensemble

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー