Semantic Alignment of Unimodal Medical Text and Vision Representations

要約

一般的なAIモデル、特にテキストとビジョンのために設計されたモデルは、幅広い深い学習タスクにわたって印象的な汎用性を示しています。
ただし、多くの場合、ドメイン固有のソリューションまたは代替の知識移転アプローチが必要な医療イメージングなどの専門ドメインでパフォーマンスが低下しています。
最近の研究では、このアライメントは自然には発生しませんが、意味的に関連するデータを処理する際に、汎用モデルが同様の潜在スペースを示すことができると指摘しています。
この洞察に基づいて、アンカーとして知られる意味的に対応するサンプルのサブセットから推定された単純な変換（ほとんどのアフィン）を適用することで、多様なトレーニングパラダイム、アーキテクチャ、およびモダリティ全体にモデルステッチを可能にすることが示されています。
この論文では、意味の調整 – アンカー間の変換の推定 – が、一般的な医療知識を汎用することができる方法を探ります。
複数のパブリックチェストX線データセットを使用して、モデルアーキテクチャ全体のモデルステッチにより、一般的なモデルが追加のトレーニングなしでドメイン固有の知識を統合し、医療タスクのパフォーマンスが向上することを実証します。
さらに、モダリティ全体でセマンティックアライメントを活用するUnimodal Visionエンコーダーの新しいゼロショット分類アプローチを導入します。
私たちの結果は、私たちの方法が一般的なマルチモーダルモデルよりも優れているだけでなく、完全に訓練された医療固有のマルチモーダルソリューションのパフォーマンスレベルにも近づいていることを示しています。

要約(オリジナル)

General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation – at most affine – estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment – estimating transformations between anchors – can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions

arxiv情報

著者	Maxime Di Folco,Emily Chan,Marta Hasny,Cosmin I. Bercea,Julia A. Schnabel
発行日	2025-03-06 14:28:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Semantic Alignment of Unimodal Medical Text and Vision Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー