SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

要約

MLLM研究の進歩を促進するには、効果的な評価が重要です。
視覚入力から将来のアクションシーケンスを生成することを目的とする外科的アクションプランニング（SAP）タスクは、正確で洗練された分析機能を要求します。
数学的推論とは異なり、外科的意思決定は人生の批判的なドメインで機能し、信頼性と患者の安全を確保するために細心の、検証可能なプロセスを必要とします。
このタスクには、原子視覚動作を区別し、現在のベンチマークによって不十分に評価される複雑で長期の手順を調整する能力が必要です。
このギャップに対処するために、マルチモーダル大型言語モデル（MLLM）が解釈可能な外科的行動計画を実行できるように設計された大規模で高品質のデータセットであるSAPベンチを導入します。
胆嚢摘出術の手順に由来するSAPベンチベンチマークは、1137.5Sの平均期間とのコンテキストに由来し、1,226の臨床的に検証されたアクションクリップ（平均期間：68.7S）を含む一時的に接地された外科的作用注釈を導入します。
データセットは、1,152の戦略的にサンプリングされた電流フレームを提供し、それぞれがマルチモーダル分析アンカーとして対応する次のアクションとペアになります。
MLLM-SAPフレームワークを提案し、MLLMを活用して、注入された外科的ドメインの知識で強化された現在の手術シーンと自然言語の指示から次のアクション推奨事項を生成します。
データセットの有効性と現在のモデルのより広範な機能を評価するために、7つの最先端のMLLM（例：OpenAI-O1、GPT-4O、QWENVL2.5-72B、3.5-SONNET、GEMINIPRO2.5、STEP-1O、およびGLM-4V）を評価し、次のアクションの踏切で重要なギャップを明らかにします。

要約(オリジナル)

Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset’s effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.

arxiv情報

著者	Mengya Xu,Zhongzhen Huang,Dillan Imans,Yiru Ye,Xiaofan Zhang,Qi Dou
発行日	2025-06-13 15:23:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー