Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

要約

ビジョン言語モデル（VLM）は最近、複数のインスタンス学習（MIL）フレームワークに統合され、スライド画像全体の少数の監視された分類（WSI）の課題に対処しました。
重要な傾向は、階層組織構造をよりよく表すためにマルチスケール情報を活用することです。
ただし、既存の方法は、多くの場合、2つの重要な制限に直面しています。（1）スケール全体の同じモダリティ内での相互作用のモデリングが不十分（5xおよび20x）、（2）同じスケールでの視覚モダリティとテキストモダリティの間の不十分なアラインメント。
これらのギャップに対処するために、階層的な関係をキャプチャするために、粗い（5x）と微細な（20x）視覚/テキストノードの間の親子リンクからなる統合グラフを構築する階層的な視覚視点フレームワークであるHive-Milを提案します。
セマンティックの一貫性をさらに強化するために、Hive-Milには、弱い相関パッチテキストペアを除去する2段階のテキスト誘導動的フィルタリングメカニズムが組み込まれ、階層的なセマンティクスをスケール間で整列させる階層対照損失を導入します。
TCGA乳房、肺がん、腎臓がんのデータセットに関する広範な実験は、Hive-MILが従来のMILと最近のVLMベースのMILアプローチの両方を一貫して上回り、16ショットの設定でマクロF1で最大4.1％の利益を達成することを示しています。
我々の結果は、限られた病理データから効率的でスケーラブルな学習のための階層構造とマルチモーダルアラインメントの共同モデリングの価値を示しています。
このコードは、https：//github.com/bryanwong17/hive-milで入手できます

要約(オリジナル)

Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL

arxiv情報

著者	Bryan Wong,Jong Woo Kim,Huazhu Fu,Mun Yong Yi
発行日	2025-05-23 14:48:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー