Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

要約

目的：手術ワークフローの解析は、手術の効率と安全性を向上させるために極めて重要である。しかし、これまでの研究では、大規模なアノテーションデータセットに大きく依存しており、コスト、スケーラビリティ、専門家のアノテーションへの依存が課題となっている。これを解決するために、我々は、最小限のペア画像ラベルデータで様々な手術ワークフロー解析タスクを処理するように設計されたSurg-FTDA（Few-shot Text-driven Adaptation）を提案する。方法我々のアプローチには2つの重要な要素がある。第一に、Few-shot selection-based modality alignmentは、画像の小さなサブセットを選択し、その埋め込みを下流タスクのテキスト埋め込みと整列させ、モダリティギャップを埋める。第二に、テキスト駆動型適応は、デコーダを訓練するためにテキストデータのみを活用し、画像とテキストのペアデータを不要にする。このデコーダを整列された画像埋め込みに適用することで、明示的な画像-テキストペアなしで画像関連タスクを可能にする。結果本アプローチを生成タスク（画像キャプション）と識別タスク（トリプレット認識と位相認識）に対して評価した。その結果、Surg-FTDAはベースラインを凌駕し、下流のタスクに渡ってよく一般化することが示された。結論我々は、モダリティギャップを緩和し、大規模な注釈付きデータセットへの依存を最小限に抑えながら、手術ワークフロー解析における複数の下流タスクを処理するテキスト駆動型適応アプローチを提案する。コードとデータセットは https://github.com/CAMMA-public/Surg-FTDA で公開される予定である。

要約(オリジナル)

Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMA-public/Surg-FTDA

arxiv情報

著者	Tingxuan Chen,Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
発行日	2025-03-03 13:05:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー