EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

要約

内視鏡処置では、異常な領域の自律的な追跡と円周方向の切断マーカーに続くことで、内視鏡師の認知負担を大幅に減らすことができます。
ただし、従来のモデルベースのパイプラインは、各コンポーネント（例：検出、モーションプランニング）に対して脆弱であり、手動チューニングと高レベルの内視鏡的意図を組み込むための闘争が必要であり、多様なシーン全体で不十分な一般化をもたらします。
視覚的認識、言語の接地、およびモーション計画をエンドツーエンドのフレームワーク内に統合するVision-Language-active（VLA）モデルは、手動の再調整なしで外科医プロンプトに意味的に適応することにより、有望な代替手段を提供します。
その可能性にもかかわらず、VLAモデルをロボット内視鏡検査に適用することは、胃腸（GI）地域の複雑で動的な解剖学的環境のためにユニークな課題を提示します。
これに対処するために、GI介入の連続体ロボット専用に設計されたEndovlaを紹介します。
内視鏡画像と外科医が発行した追跡プロンプトを考慮して、Endovlaは3つのコアタスクを実行します：（1）ポリープ追跡、（2）異常な粘膜領域の描写とフォロー、および（3）円周切断中の円形マーカーへの付着。
データの希少性とドメインシフトに取り組むために、Endovla-Motionデータセットで監視された微調整を含むデュアルフェーズ戦略を提案し、タスクを意識した報酬を備えた微調整を強化します。
私たちのアプローチは、内視鏡検査の追跡パフォーマンスを大幅に改善し、多様なシーンや複雑なシーケンシャルタスクでゼロショットの一般化を可能にします。

要約(オリジナル)

In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.

arxiv情報

著者	Chi Kit Ng,Long Bai,Guankun Wang,Yupeng Wang,Huxin Gao,Kun Yuan,Chenhan Jin,Tieyong Zeng,Hongliang Ren
発行日	2025-05-21 07:35:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー