ViLReF: A Chinese Vision-Language Retinal Foundation Model

要約

網膜画像データとテキストデータの意味上の微妙な違いは、視覚言語モデルの事前トレーニングに大きな課題をもたらします。
さらに、偽陰性サンプル、つまり同じセマンティクスを持つが誤って陰性とみなされる画像とテキストのペアは、視覚言語の事前トレーニングプロセスを混乱させ、モデルの学習能力に影響を与えます。
この研究は、451,956 枚の網膜画像と対応する診断テキストレポートで構成されるペアのデータセットで事前トレーニングすることにより、ViLReF と呼ばれる網膜基礎モデルを開発することを目的としています。
私たちのビジョン言語の事前トレーニング戦略では、専門知識を活用してラベルの抽出を容易にし、特徴空間内で動的にサンプルペアをさらに遠くに押し出す速度を調整するための新しい制約である重み付き類似性結合損失を提案します。
さらに、運動量エンコーダによって維持される動的メモリキューを備えたバッチ拡張モジュールを採用して、追加のサンプルを供給し、偽陰性を排除することによって生じる空きを補償します。
下流の分類およびセグメンテーションのタスクのために、複数のデータセットに対して広範な実験が行われます。
実験結果は、ViLReF の強力なゼロショットおよび転移学習機能を実証し、事前トレーニング戦略の有効性を検証します。
ViLReF モデルは https://github.com/T6Yang/ViLReF から入手できます。

要約(オリジナル)

Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model’s learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: https://github.com/T6Yang/ViLReF.

arxiv情報

著者	Shengzhu Yang,Jiawei Du,Jia Guo,Weihang Zhang,Hanruo Liu,Huiqi Li,Ningli Wang
発行日	2024-08-20 14:27:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViLReF: A Chinese Vision-Language Retinal Foundation Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー