Deep Correlated Prompting for Visual Recognition with Missing Modalities

要約

大規模なマルチモーダルモデルは、ペアになったマルチモーダルトレーニングデータの大規模なコーパスを活用した一連のタスクで優れたパフォーマンスを示しています。
一般に、それらは常にモダリティ完全な入力を受信すると想定されます。
ただし、この単純な仮定は、プライバシーの制約や収集の困難さのため、現実の世界では必ずしも当てはまらない可能性があります。モダリティが完全なデータで事前トレーニングされたモデルは、モダリティが欠落しているケースではパフォーマンスの低下を容易に示します。
この問題に対処するために、さまざまな欠落ケースをさまざまなタイプの入力とみなして、欠落モダリティのシナリオを処理するように大規模な事前トレーニング済みマルチモーダルモデルを適応させるための学習を促進することを指します。
独立したプロンプトを中間層の先頭に追加するだけではなく、プロンプトと入力特徴間の相関関係を活用し、プロンプトの異なる層間の関係を掘り起こして、指示を慎重に設計することを提案します。
また、さまざまなモダリティの補完的なセマンティクスを組み込んで、各モダリティのプロンプト設計をガイドします。
一般的に使用される 3 つのデータセットに関する広範な実験により、さまざまな欠落シナリオに対する以前のアプローチと比較して、私たちの方法の優位性が一貫して実証されています。
さらに、さまざまなモダリティ欠損率とタイプに対する本発明の方法の一般化可能性と信頼性を示すために、豊富なアブレーションが提供されます。

要約(オリジナル)

Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. Generally, they are always assumed to receive modality-complete inputs. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty, where models pretrained on modality-complete data easily demonstrate degraded performance on missing-modality cases. To handle this issue, we refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input. Instead of only prepending independent prompts to the intermediate layers, we present to leverage the correlations between prompts and input features and excavate the relationships between different layers of prompts to carefully design the instructions. We also incorporate the complementary semantics of different modalities to guide the prompting design for each modality. Extensive experiments on three commonly-used datasets consistently demonstrate the superiority of our method compared to the previous approaches upon different missing scenarios. Plentiful ablations are further given to show the generalizability and reliability of our method upon different modality-missing ratios and types.

arxiv情報

著者	Lianyu Hu,Tongkai Shi,Wei Feng,Fanhua Shang,Liang Wan
発行日	2024-10-21 14:11:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deep Correlated Prompting for Visual Recognition with Missing Modalities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー