BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

要約

ビジョン言語モデル (VLM) の開発は、大規模で多様なマルチモーダルデータセットによって推進されます。
しかし、生物学と医学にわたる、注釈付きで公的にアクセス可能なデータセットが不足しているため、ジェネラリスト生物医学 VLM への進歩は限られています。
既存の取り組みは狭い領域に限定されており、科学文献にコード化された生物医学知識の多様性が完全に欠けています。
このギャップに対処するために、私たちは BIOMEDICA を導入しました。BIOMEDICA は、PubMed Central オープンアクセスサブセット全体を抽出し、注釈を付け、シリアル化し、使いやすく公的にアクセス可能なデータセットにするためのスケーラブルなオープンソースフレームワークです。私たちのフレームワークは、包括的なアーカイブを生成します。
600 万以上の記事からの 2,400 万以上の一意の画像とテキストのペアが含まれています。
メタデータと専門家による注釈も提供されます。
私たちは、ストリーミング経由で BIOMEDICA データセットで継続的に事前トレーニングされた CLIP スタイルのモデルのスイートである BMCA-CLIP をリリースすることで、リソースの有用性とアクセシビリティを実証します。これにより、27 TB のデータをローカルにダウンロードする必要がなくなります。平均して、私たちのモデルは次のことを達成します。
病理学、放射線学、眼科、皮膚科、外科、分子生物学、寄生虫学、細胞生物学にわたる 40 のタスクにわたる最先端のパフォーマンスで優れています。
ゼロショット分類により、平均 6.56% の向上 (皮膚科と眼科ではそれぞれ 29.8% と 17.5%) が実現し、画像テキスト検索が強化され、すべての処理量は 10 分の 1 に抑えられます。
再現性とコラボレーションを促進するために、私たちはコードベースとデータセットをより広範な研究コミュニティにリリースします。

要約(オリジナル)

The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.On average, our models achieve state-of-the-art performance across 40 tasks – spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology – excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

arxiv情報

著者	Alejandro Lozano,Min Woo Sun,James Burgess,Liangyu Chen,Jeffrey J Nirschl,Jeffrey Gu,Ivan Lopez,Josiah Aklilu,Austin Wolfgang Katzer,Collin Chiu,Anita Rau,Xiaohan Wang,Yuhui Zhang,Alfred Seunghoon Song,Robert Tibshirani,Serena Yeung-Levy
発行日	2025-01-13 09:58:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー