MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

要約

画像ベースの医療質問に対して言語で応答する Medical Visual Question Answering (MedVQA) は、困難なタスクであり、医療における大幅な進歩を表しています。
医療専門家が医療画像を迅速に解釈できるように支援し、より迅速かつ正確な診断を可能にします。
ただし、既存の MedVQA ソリューションのモデルの解釈可能性と透明性には限界があることが多く、意思決定プロセスを理解する際に課題が生じています。
この問題に対処するために、データ準備を効率化する半自動アノテーションプロセスを考案し、新しいベンチマーク MedVQA データセット R-RAD および R-SLAKE を構築します。
R-RAD および R-SLAKE データセットは、マルチモーダル大規模言語モデルと、既存の MedVQA データセット (VQA-RAD および SLAKE) の質問と回答のペアに対する人間の注釈によって生成された、中間的な医療意思決定の根拠を提供します。
さらに、医学的な意思決定の理論的根拠をトレーニングプロセスに組み込むことで、軽量の事前トレーニング済み生成モデルを微調整する新しいフレームワークを設計します。
このフレームワークには、意思決定の結果と対応する根拠を生成するための 3 つの異なる戦略が含まれており、それによって推論中の医療意思決定プロセスを明確に示します。
広範な実験により、私たちの方法は R-RAD で 83.5%、R-SLAKE で 86.3% の精度を達成でき、既存の最先端のベースラインを大幅に上回っていることが実証されました。
データセットとコードは公開されます。

要約(オリジナル)

Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

arxiv情報

著者	Xiaotang Gai,Chenyi Zhou,Jiaxiang Liu,Yang Feng,Jian Wu,Zuozhu Liu
発行日	2024-04-18 17:53:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー