Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

要約

Foundation モデルは、病気の診断やテキストレポートの生成など、さまざまなアプリケーションで目覚ましい成功を収めています。
現在のところ、内視鏡ビデオ分析の基礎モデルはまだ不足しています。
本稿では、膨大な内視鏡ビデオデータを使用して特別に開発された基礎モデルであるEndo-FMを提案します。
まず、空間的次元と時間的次元にわたるローカルとグローバルの両方の長距離依存関係をキャプチャするビデオトランスフォーマーを構築します。
2 番目に、自己教師ありの方法でグローバルビューとローカルビューを使用してトランスフォーマーモデルを事前トレーニングし、時空間の変動に対して堅牢で、さまざまなシーン間で区別できるようにすることを目指しています。
基礎モデルを開発するために、9 つの公的に利用可能なデータセットと、中国の上海にある仁吉病院の宝山分院から非公開で収集されたデータセットを組み合わせて、大規模な内視鏡ビデオデータセットを構築しました。
私たちのデータセット全体は、最大 500 万フレームの 33,000 を超えるビデオクリップで構成されており、さまざまなプロトコル、標的臓器、疾患の種類が含まれています。
当社の事前トレーニング済み Endo-FM は、バックボーンとして機能することにより、微調整を通じて特定の下流タスクに簡単に採用できます。
分類、セグメンテーション、検出を含む 3 つの異なるタイプの下流タスクに関する実験により、当社の Endo-FM は、現在の最先端 (SOTA) 自己教師あり事前トレーニングおよびアダプターベースの転移学習手法をはるかに上回っています。
VCL (分類、セグメンテーション、および検出に対して 3.1% F1、4.8% Dice、および 5.5% F1) および ST アダプター (分類、セグメンテーション、および検出に対して 5.9% F1、9.6% Dice、および 9.9% F1) などの有意なマージン
検出）。
コード、データセット、モデルは https://github.com/med-air/Endo-FM でリリースされています。

要約(オリジナル)

Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downstream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art (SOTA) self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1, 4.8% Dice, and 5.5% F1 for classification, segmentation, and detection) and ST-Adapter (5.9% F1, 9.6% Dice, and 9.9% F1 for classification, segmentation, and detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.

arxiv情報

著者	Zhao Wang,Chang Liu,Shaoting Zhang,Qi Dou
発行日	2024-01-09 11:30:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー