From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

要約

医療画像セグメンテーションは、トレーニング用のピクセルレベルの注釈のコストが高いため、依然として困難です。
監督が弱いという文脈では、臨床医の視線データは診断的な関心のある地域を捉えています。
ただし、そのスパースはセグメンテーションへの使用を制限しています。
対照的に、ビジョン言語モデル（VLM）は、テキストの説明を通じてセマンティックコンテキストを提供しますが、説明の精度が必要です。
どちらのソースだけでも十分ではないことを認識して、私たちは視線と言語の監督の両方を統合し、補完的な強みを活用する教師と学生のフレームワークを提案します。
私たちの重要な洞察は、視線データが診断中に臨床医がどこに焦点を合わせているかを示し、VLMSがそれらの地域が重要である理由を説明することです。
これを実装するために、教師モデルはまず、病変の形態のVLM生成された説明によって強化された視線から学習し、学生モデルを導くための基盤を確立します。
その後、教師は3つの戦略を通じて生徒に指示します。（1）マルチスケール機能のアラインメントは、視覚的な手がかりをテキストセマンティクスと融合させます。
（2）信頼できる予測に焦点を当てる信頼加重された一貫性の制約。
（3）不確実な領域でのエラー伝播を制限するための適応マスキング。
Kvasir-SEG、NCI-ISBI、およびISICデータセットの実験では、この方法では、それぞれ80.78％、80.53％、および84.22％のサイコロスコアが達成され、注釈の負担を増加させることなく視線ベースラインに3-5％改善することが示されています。
予測、視線データ、病変の記述間の相関を維持することにより、私たちのフレームワークは臨床的解釈も維持します。
この作業は、人間の視覚的注意とAIに生成されたセマンティックコンテキストを統合することで、個々の弱い監督シグナルの制限を効果的に克服し、それによって展開可能な注釈効率の高い医療AIシステムの開発を進める方法を示しています。
コードは、https：//github.com/jingkunchen/fgi.gitで入手できます。

要約(オリジナル)

Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.

arxiv情報

著者	Jingkun Chen,Haoran Duan,Xiao Zhang,Boyan Gao,Tao Tan,Vicente Grau,Jungong Han
発行日	2025-04-15 16:32:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー