HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

要約

GPT-4V などのマルチモーダル大規模言語モデル (MLLM) の急速な開発は、大幅な進歩をもたらしました。
しかし、これらのモデルは、データプライバシーの懸念と高いアノテーションコストに起因する医療ビジョンテキストデータの量と質の制限により、医療マルチモーダル機能において依然として課題に直面しています。
先駆的なアプローチでは、PubMed の大規模な匿名化された医療画像とテキストのペアを利用してこれらの制限に対処していますが、固有のデータノイズにより依然として不十分です。
これに取り組むために、私たちは PubMed からの医療画像とテキストのペアを改良し、「非盲検」機能で MLLM (GPT-4V) を採用してデータのノイズ除去と再フォーマットを行い、その結果、130 万の医療 VQA サンプルを含む PubMedVision データセットが作成されました。
(1) PubMedVision は現在の MLLM の医療マルチモーダル機能を大幅に強化でき、MMMU Health & Medicine トラックを含むベンチマークで大幅な改善が見られます。
(2) 医療専門家による手動チェックと経験的結果により、他のデータ構築方法と比較して当社のデータセットの優れたデータ品質が検証されます。
PubMedVision を使用して、34B 医療 MLLM HuatuoGPT-Vision をトレーニングしました。これは、オープンソース MLLM 間の医療マルチモーダルシナリオで優れたパフォーマンスを示します。

要約(オリジナル)

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

arxiv情報

著者	Junying Chen,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang
発行日	2024-06-27 15:50:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー