HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

要約

自然言語は、生のテキストから広範な監督ソースを提供することで、ジェネラリスト手術モデルの開発において重要な役割を果たす可能性があります。
この柔軟な形式の監視により、自然言語を使用して学習した視覚概念を参照したり、新しい概念を説明したりできるため、データセットやタスク間でのモデルの移行が可能になります。
この研究では、ジェネラリスト手術モデルを構築するための新しい階層ビデオ言語事前トレーニングアプローチである HecVL を紹介します。
具体的には、外科講義ビデオと 3 つの階層レベルのテキストを組み合わせて、階層的なビデオとテキストのペアのデータセットを構築します。
フェーズレベルの概念的なテキストの要約。
そしてビデオレベルでは、外科手術の全体的な抽象的なテキスト。
次に、単一のモデルを使用して 3 つのビデオテキスト階層の個別の埋め込み空間を学習する、新しい細かい-粗い対比学習フレームワークを提案します。
異なる階層レベルの埋め込み空間を解きほぐすことにより、学習されたマルチモーダル表現は、同じモデル内で短期および長期の外科的概念をエンコードします。
注入されたテキストセマンティクスのおかげで、HecVL アプローチにより人間による注釈なしでゼロショットの手術段階認識が可能になることが実証されました。
さらに、手術段階認識のための同じ HecVL モデルが、異なる手術手順や医療センターに転送できることを示します。

要約(オリジナル)

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model’s transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.

arxiv情報

著者	Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
発行日	2024-05-16 13:14:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー