MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

要約

全体のスライド病理学の画像分類は、ギガピクセルの画像サイズと限られた注釈ラベルのために課題を提示し、モデルの一般化を妨げます。
このペーパーでは、少数のショット病理分類のために大規模なビジョン言語モデルを適応させるための迅速な学習方法を紹介します。
最初に、13億の病理学のイメージタイルで事前に訓練されたProv-Gigapath Vision Foundationモデルを、アダプターを追加し、923K画像テキストペアの対照学習を介して医療テキストエンコーダーと整列することにより、ビジョン言語モデルに拡張します。
次に、このモデルを使用して、視覚的な特徴とテキスト埋め込みを抽出して、少数のショット注釈と微細チューンを学習可能な迅速な埋め込みで抽出します。
プロンプトと接頭辞埋め込みまたは自己触媒を使用して凍結機能を組み合わせた以前の方法とは異なり、学習可能なプロンプトとそれらのグループとの相互作用を比較する多顆粒の注意を提案します。
このアプローチは、細かい細部とより広いコンテキストの両方をキャプチャするモデルの能力を改善し、サブリージョン全体の複雑なパターンの認識を高めます。
精度をさらに向上させるために、データ増強プロセス中に発生する可能性のある摂動を緩和することにより、モデルの堅牢性を確保するために、最適な輸送ベースの視覚テキスト距離を活用します。
肺、腎臓、および乳房の病理学のモダリティに関する経験的実験は、私たちのアプローチの有効性を検証します。
これにより、最新の競合他社のいくつかを上回り、クリップ、プリップ、プロブギガパス統合プリップなど、多様なアーキテクチャ全体のパフォーマンスを一貫して改善します。
このMGPATHで実装と事前に訓練されたモデルをリリースします。

要約(オリジナル)

Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this MGPATH.

arxiv情報

著者	Anh-Tien Nguyen,Duy Minh Ho Nguyen,Nghiem Tuong Diep,Trung Quoc Nguyen,Nhat Ho,Jacqueline Michelle Metsch,Miriam Cindy Maurer,Daniel Sonntag,Hanibal Bohnenberger,Anne-Christin Hauschild
発行日	2025-05-13 15:09:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー