Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models

要約

診断イメージングは、画像と放射線レポートの両方の解釈に依存していますが、データ量の増加は医療専門家に大きな圧力をかけ、エラーとワークフローのバックログの増加をもたらします。
Medical Vision-Languageモデル（MED-VLMS）は、特に胸部X線（CXR）評価でマルチモーダルイメージングデータを効率的に処理する強力なフレームワークとして浮上しています。
主に対照的な学習に基づいている既存のアライメント方法は、場所、サイズ、または重症度などの細かい病理属性の分離をめぐる疾患クラス間の分離を優先し、最適ではない表現につながります。
ここでは、Medtrim（メタエンティティ駆動型のトリプレットマイニング）を提案します。これは、疾患クラスと形容詞および方向性病理記述子によって相乗的に導かれるマルチモーダルトリプレット学習を通じて画像テキストアライメントを強化する新しい方法です。
広い疾患クラスを分離する一般的なアライメント方法とは異なり、Medtrimは構造化されたメタエンティティ情報を活用して、微妙ではあるが臨床的に有意なクラス内変動を維持します。
この目的のために、CXRレポートから病理学固有のメタエンティティを抽出するオントロジーベースのエンティティ認識モジュールを導入します。病理属性に関する注釈は公共データセットではまれであるためです。
トリプレットマイニングでの洗練されたサンプル選択のために、疾患クラスと形容詞/方向記述子に基づいて、サンプル間類似性の総尺度をキャプチャする新しいスコア関数を導入します。
最後に、詳細な病理特性を共有するサンプル間の明示的な内側およびクロスモーダルアライメントのために、マルチモーダルトリプレットアライメント目標を導入します。
私たちのデモンストレーションは、Medtrimが最先端のアライメント方法と比較して、下流の検索および分類タスクのパフォーマンスを改善することを示しています。

要約(オリジナル)

Diagnostic imaging relies on interpreting both images and radiology reports, but the growing data volumes place significant pressure on medical experts, yielding increased errors and workflow backlogs. Medical vision-language models (med-VLMs) have emerged as a powerful framework to efficiently process multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit their performance hinges on how well image and text representations are aligned. Existing alignment methods, predominantly based on contrastive learning, prioritize separation between disease classes over segregation of fine-grained pathology attributes like location, size or severity, leading to suboptimal representations. Here, we propose MedTrim (Meta-entity-driven Triplet mining), a novel method that enhances image-text alignment through multimodal triplet learning synergistically guided by disease class as well as adjectival and directional pathology descriptors. Unlike common alignment methods that separate broad disease classes, MedTrim leverages structured meta-entity information to preserve subtle but clinically significant intra-class variations. For this purpose, we first introduce an ontology-based entity recognition module that extracts pathology-specific meta-entities from CXR reports, as annotations on pathology attributes are rare in public datasets. For refined sample selection in triplet mining, we then introduce a novel score function that captures an aggregate measure of inter-sample similarity based on disease classes and adjectival/directional descriptors. Lastly, we introduce a multimodal triplet alignment objective for explicit within- and cross-modal alignment between samples sharing detailed pathology characteristics. Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.

arxiv情報

著者	Saban Ozturk,Melih B. Yilmaz,Muti Kara,M. Talat Yavuz,Aykut Koç,Tolga Çukur
発行日	2025-04-22 14:17:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー