Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

要約

差分視覚質問応答 (diff-VQA) は、一対の画像間の違いに基づいて複雑な質問に答える必要がある難しいタスクです。
放射線科医は臨床現場で病気の進行や重症度の変化を追跡するために、異なる時期に撮影された同じ患者の複数の画像を比較することが多いため、この作業は胸部 X 線画像を読み取る場合に特に重要です。
ただし、これまでの研究では、diff-VQA タスク用の特定のネットワークアーキテクチャの設計に重点が置かれており、事前トレーニングされたビジョン言語モデル (VLM) を使用してモデルのパフォーマンスを向上させる機会が失われていました。
ここでは、PLURAL と呼ばれる新しい VLM を紹介します。これは、diff-VQA タスク用に自然および縦方向の胸部 X 線データで事前トレーニングされています。
このモデルは段階的なアプローチを使用して開発されており、まず自然画像とテキストで事前トレーニングされ、続いて縦方向の胸部 X 線データを使用してトレーニングされます。
縦断的データは、一対の X 線画像、質問と回答のセット、および肺の異常や病気の経時的な変化を説明する放射線科医のレポートで構成されます。
私たちの実験結果は、PLURAL モデルが縦断 X 線の差分 VQA だけでなく、単一 X 線画像に対する従来の VQA においても最先端の方法よりも優れていることを示しています。
広範な実験を通じて、提案された VLM アーキテクチャと事前トレーニング方法がモデルのパフォーマンスを向上させる有効性を実証しました。

要約(オリジナル)

Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model’s performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist’s reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model’s performance.

arxiv情報

著者	Yeongjae Cho,Taehee Kim,Heejun Shin,Sungzoon Cho,Dongmyung Shin
発行日	2024-02-14 06:20:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー