HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

要約

自然言語は、生のテキストから幅広い監督源を提供することにより、ジェネラリストの外科モデルを開発する上で重要な役割を果たすことができます。
この柔軟な形式の監督は、自然言語を使用して学習した視覚概念を参照したり、新しい概念を説明したりするため、データセットとタスクを介したモデルの転送可能性を可能にします。
この作業では、ジェネラリストの手術モデルを構築するための新しい階層ビデオ言語前脱出アプローチであるHECVLを提示します。
具体的には、外科的講義ビデオと3つの階層レベルのテキストとペアリングすることにより、階層ビデオテキストペアのデータセットを作成します。
フェーズレベルの概念テキストの要約。
そして、ビデオレベルの外科的処置の全体的な抽象的なテキストで。
次に、単一のモデルを使用して3つのビデオテキスト階層の個別の埋め込みスペースを学習する新しい細かい微細なコントラスト学習フレームワークを提案します。
異なる階層レベルの埋め込みスペースを解くことにより、学習したマルチモーダル表現は、同じモデルの短期および長期の外科的概念をエンコードします。
注入されたテキストセマンティクスのおかげで、HECVLアプローチが人間の注釈なしでゼロショット外科期認識を可能にすることができることを実証します。
さらに、外科相認識のための同じHECVLモデルを、さまざまな外科的処置と医療センターに移動できることを示します。
このコードはhttps://github.com/camma-public/surgvlpで入手できます

要約(オリジナル)

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model’s transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers. The code is available at https://github.com/CAMMA-public/SurgVLP

arxiv情報

著者	Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
発行日	2025-03-13 15:27:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー