Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

要約

手術ビデオの自動要約は、手順文書化を強化し、外科的訓練をサポートし、術後分析を促進するために不可欠です。
この論文は、人工知能と医学の交差点で、外科的文脈での直接的な現実世界の用途を備えた機械学習モデルを開発することを目的とした新しい方法を紹介します。
コンピュータービジョンと大規模な言語モデルの最近の進歩を活用して、包括的なビデオ要約を生み出すマルチモーダルフレームワークを提案します。
％アプローチは3つの重要な段階で構成されています。
まず、手術ビデオはクリップに分割され、視覚的な機能が視覚的な変圧器を使用してフレームレベルで抽出されます。
このステップでは、ツール、組織、臓器、および外科的作用の検出に焦点を当てています。
第二に、抽出された機能は、大規模な言語モデルを介してフレームレベルのキャプションに変換されます。
これらは、Vivitベースのエンコーダーを使用してキャプチャされ、各ビデオセグメントのより広いコンテキストを反映するクリップレベルの概要を作成する一時的な機能と組み合わされます。
最後に、クリップレベルの説明は、要約タスクに合わせた専用のLLMを使用して、完全な外科的報告に集約されます。
％50のLaparoscopicビデオからの機器とアクションの注釈を使用して、ChoLect50データセットでの方法を評価します。
結果は強力なパフォーマンスを示し、ツール検出で96 \％精度を達成し、時間的コンテキストの要約では0.74のBERTスコアを達成します。
この作業は、外科的報告のためのAI支援ツールの進歩に貢献し、よりインテリジェントで信頼できる臨床文書化への一歩を提供します。

要約(オリジナル)

The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. This paper presents a novel method at the intersection of artificial intelligence and medicine, aiming to develop machine learning models with direct real-world applications in surgical contexts. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. % The approach is structured in three key stages. First, surgical videos are divided into clips, and visual features are extracted at the frame level using visual transformers. This step focuses on detecting tools, tissues, organs, and surgical actions. Second, the extracted features are transformed into frame-level captions via large language models. These are then combined with temporal features, captured using a ViViT-based encoder, to produce clip-level summaries that reflect the broader context of each video segment. Finally, the clip-level descriptions are aggregated into a full surgical report using a dedicated LLM tailored for the summarization task. % We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos. The results show strong performance, achieving 96\% precision in tool detection and a BERT score of 0.74 for temporal context summarization. This work contributes to the advancement of AI-assisted tools for surgical reporting, offering a step toward more intelligent and reliable clinical documentation.

arxiv情報

著者	Hugo Georgenthum,Cristian Cosentino,Fabrizio Marozzo,Pietro Liò
発行日	2025-04-28 15:46:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー