ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

要約

複数のインスタンス学習（MIL）ベースのフレームワークは、デジタル病理におけるギガピクセルサイズと階層画像コンテキストを備えたスライド画像（WSI）全体を処理するための主流になりました。
ただし、これらの方法は、かなりの数のバッグレベルのラベルに大きく依存しており、データ分布のばらつきによって簡単に影響を受ける元のスライドからのみ学習します。
最近、Vision Language Model（VLM）ベースのメソッドが、大規模な病理学的画像テキストペアでトレーニング前に事前に言語を導入しました。
ただし、以前のテキストプロンプトには病理学的事前知識の考慮が欠けているため、モデルのパフォーマンスを大幅に向上させません。
さらに、そのようなペアとトレーニング前のプロセスの収集は非常に時間がかかり、ソース集約型です。上記の問題を解決するために、スライド全体のデュアルスケールビジョン言語複数インスタンス学習（VILA-MIL）フレームワークを提案します。
画像分類。
具体的には、VLMのパフォーマンスを効果的に向上させるために、凍結した大手言語モデル（LLM）に基づいて、デュアルスケールの視覚的記述テキストプロンプトを提案します。
VLMを転送してWSIを効率的に処理するために、画像ブランチの場合、同様のパッチを同じプロトタイプにグループ化することにより、パッチ機能を徐々に集計するためにプロトタイプ誘導パッチデコーダーを提案します。
テキストブランチには、マルチ粒画像のコンテキストを組み込むことにより、テキスト機能を強化するためのコンテキストガイド付きテキストデコーダーを紹介します。
3つのマルチキャンサーとマルチセンターサブタイピングデータセットに関する広範な研究は、Vila-Milの優位性を示しています。

要約(オリジナル)

Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model’s performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive.To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

arxiv情報

著者	Jiangbo Shi,Chen Li,Tieliang Gong,Yefeng Zheng,Huazhu Fu
発行日	2025-02-12 13:28:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー