Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

要約

外科的コンピュータービジョンアプリケーションの最近の進歩は、視覚のみのモデルによって推進されており、言語の豊富なセマンティクスを設計に明示的に統合していません。
これらの方法は、手動で注釈された外科ビデオに依存して、オブジェクトカテゴリの固定セットを予測し、目に見えない外科的処置と下流のタスクにそれらの一般化を制限します。
この作業では、手動注釈に頼らずにマルチモーダル表現学習のための効果的なビジョンと言語監督のシグナルを提供できるオープン外科用eラーニングプラットフォームを通じて利用可能な外科ビデオ講義が提供できるという考えを提案しました。
複数の補完的な自動音声認識システムを採用してテキストの転写を生成することにより、手術ビデオ講義に存在する手術固有の言語的課題に対処します。
次に、マルチモーダル表現学習のための新しい方法であるSurgVLP-外科的視力言語の事前トレーニングを提示します。
多様な外科的処置とタスクにわたる広範な実験は、surgVLPによって学んだマルチモーダル表現が、外科的ビデオ分析において強力な移動性と適応性を示すことを示しています。
さらに、私たちのゼロショット評価は、外科的ワークフロー分析のための汎用基礎モデルとしてのSurgVLPの可能性を強調し、ダウンストリームタスクの広範な手動注釈への依存を減らし、さまざまな下流の外科的応用のためのスケーラブルでデータ効率の高いソリューションを構築するための少数のショット学習などの適応方法を促進します。
[トレーニングコード]（https://github.com/camma-public/surgvlp）および[weights]（https://github.com/camma-public/peskavlp）は公開されています。

要約(オリジナル)

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP – Surgical Vision Language Pre-training, for multi-modal representation learning. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP’s potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The [training code](https://github.com/CAMMA-public/SurgVLP) and [weights](https://github.com/CAMMA-public/PeskaVLP) are public.

arxiv情報

著者	Kun Yuan,Vinkle Srivastav,Tong Yu,Joel L. Lavanchy,Jacques Marescaux,Pietro Mascagni,Nassir Navab,Nicolas Padoy
発行日	2025-03-27 15:37:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー