Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

要約

Surgical Video-Language Pretraining（VLP）は、知識領域のギャップとマルチモーダルデータの希少性により、独自の課題に直面しています。
この研究の目的は、外科的講義ビデオのテキスト情報の損失と外科的VLPの空間的課題に関する問題に対処することにより、ギャップを埋めることを目的としています。
これらの問題に取り組むために、階層的な知識増強アプローチと、新しい手順でエンコードされた外科的知識をエンコードする外科的知識を熟成したビデオ言語前処理（PESKAVLP）フレームワークを提案します。
ナレッジ増強は、外科的概念を改良および濃縮するために大規模な言語モデル（LLM）を使用しているため、包括的な言語監督を提供し、過剰適合のリスクを軽減します。
Peskavlpは、言語の監督と視覚的な自己監視を組み合わせて、硬性負のサンプルを構築し、動的タイムワーピング（DTW）ベースの損失関数を採用して、クロスモーダルの手順アライメントを効果的に理解します。
複数の公共の外科シーンの理解とクロスモーダル検索データセットに関する広範な実験は、提案された方法がゼロショット転送パフォーマンスを大幅に改善し、外科シーンの理解におけるさらなる進歩のための一般主義者の視覚的表現を提供することを示しています。

要約(オリジナル)

Surgical video-language pretraining (VLP) faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. This study aims to bridge the gap by addressing issues regarding textual information loss in surgical lecture videos and the spatial-temporal challenges of surgical VLP. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining (PeskaVLP) framework to tackle these issues. The knowledge augmentation uses large language models (LLM) for refining and enriching surgical concepts, thus providing comprehensive language supervision and reducing the risk of overfitting. PeskaVLP combines language supervision with visual self-supervision, constructing hard negative samples and employing a Dynamic Time Warping (DTW) based loss function to effectively comprehend the cross-modal procedural alignment. Extensive experiments on multiple public surgical scene understanding and cross-modal retrieval datasets show that our proposed method significantly improves zero-shot transferring performance and offers a generalist visual representation for further advancements in surgical scene understanding.The code is available at https://github.com/CAMMA-public/SurgVLP

arxiv情報

著者	Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
発行日	2025-03-13 15:21:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー