Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

要約

近年、マルチモーダル大規模言語モデル (MLLM) が顕著な進歩を遂げ、インテリジェントな生物医学アシスタントの開発の実現可能性が実証されました。
ただし、現在の生物医学 MLLM は主に画像レベルの理解に焦点を当てており、対話をテキストコマンドに制限しているため、機能の境界と使用の柔軟性が制限されています。
この論文では、ピクセルレベルの理解を備えた MedPLIB という生物医学領域向けの新しいエンドツーエンドのマルチモーダル大規模言語モデルを紹介します。
興味深いことに、視覚的な質問応答 (VQA)、任意のピクセルレベルのプロンプト (点、境界ボックス、および自由形式の形状)、およびピクセルレベルのグラウンディングをサポートしています。
我々は、MoE を視覚言語エキスパートモデルとピクセルグラウンディングエキスパートモデルの個別のトレーニングフェーズに分割し、その後 MoE を使用して微調整する、新しい専門家混合 (MoE) 多段階トレーニング戦略を提案します。
この戦略は、単一のエキスパートモデルと同等の推論時の計算コストを維持しながら、マルチタスク学習を効果的に調整します。
生物医学 MLLM の研究を推進するために、複雑な医用画像の質問応答と画像領域の理解のための 8 つのモダリティから構成される Medical Complex Vision Question Answering Dataset (MeCoVQA) を導入します。
実験結果は、MedPLIB が複数の医療視覚言語タスクにわたって最先端の成果を達成したことを示しています。
さらに重要なことは、ピクセルグラウンディングタスクのゼロショット評価において、MedPLIB は、mDice メトリクスでそれぞれ 19.7 と 15.6 のマージンで最良の小型モデルと大型モデルをリードしています。
コード、データ、モデルのチェックポイントは、https://github.com/ShawnHuang497/MedPLIB で公開されます。

要約(オリジナル)

In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.

arxiv情報

著者	Xiaoshuang Huang,Lingdong Shen,Jia Liu,Fangxin Shang,Hongxiang Li,Haifeng Huang,Yehui Yang
発行日	2025-01-10 10:07:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー