Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

要約

特定の医療スキャンで正確な病理学的領域を局在することは、従来、大量の境界のある箱の真理アノテーションを正確に解決する必要がある重要なイメージングの問題です。
ただし、付随するフリーテキストレポートなど、潜在的に弱い監督の代替が存在します。これらはすぐに利用できます。テキストガイダンスでローカリゼーションを実行するタスクは、一般にフレーズの接地と呼ばれます。
この作業では、公開されている基礎モデル、つまり潜在的な拡散モデルを使用して、この挑戦的なタスクを実行します。
この選択は、本質的に生成されているにもかかわらず、潜在的な拡散モデルには、視覚的特徴とテキストの特徴を暗黙的に整列させるクロスアテンションメカニズムが含まれているため、手元のタスクに適した中間表現につながるという事実によってサポートされています。
さらに、このタスクをゼロショットで実行することを目指しています。つまり、ターゲットタスクに関するトレーニングなしで、モデルの重みが凍結されたままであることを意味します。
この目的のために、特徴を選択し、さらに学習可能なパラメーターなしで後処理を介して機能を選択する戦略を考案します。
提案された方法を、対照的な学習を介して共同埋め込み空間で画像テキストアラインメントを明示的に強制する最先端のアプローチと比較します。
人気のある胸部X線ベンチマークの結果は、私たちの方法がさまざまな種類の病理学でSOTAと競合しており、2つのメトリック（平均IOUとAUC-ROC）に関して平均してそれらを上回ることを示しています。
ソースコードは、\ url {https://github.com/vios-s}で受け入れられるとリリースされます。

要約(オリジナル)

Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available.The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model’s weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at \url{https://github.com/vios-s}.

arxiv情報

著者	Konstantinos Vilouras,Pedro Sanchez,Alison Q. O’Neil,Sotirios A. Tsaftaris
発行日	2025-01-29 16:43:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー