Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

要約

TransUNet 深層学習フレームワークに時間的特徴ブレンディングを統合することによって構築された、医療用 CT ビデオのインスタンスセグメンテーション用のディープアーキテクチャ、Video-TransUNet を提案します。
特に、私たちのアプローチは、ResNet CNN バックボーンを介した強力なフレーム表現、Temporal Context Module (TCM) を介したマルチフレーム機能ブレンディング、Vision Transformer を介した非局所的な注意、および UNet ベースの畳み込みを介した複数のターゲットの再構築機能を融合します。
-複数のヘッドを備えたデコンボリューションアーキテクチャ。
この新しいネットワーク設計は、Videofluoroscopic Swallowing Study (VFSS) CT シーケンスでボーラスと咽頭/喉頭のセグメンテーションでテストした場合、他の最先端のシステムよりも大幅に優れていることを示しています。
私たちの VFSS2022 データセットでは、$0.8796\%$ のダイス係数と $1.0379$ ピクセルの平均表面距離を実現しています。
咽頭ボーラスを正確に追跡することは、嚥下障害の診断のための主要な方法を構成するため、臨床診療において特に重要なアプリケーションであることに注意してください。
私たちの調査結果は、提案されたモデルが、一時的な情報を活用し、セグメンテーションのパフォーマンスを大幅に改善することにより、TransUNet アーキテクチャを実際に強化できることを示唆しています。
簡単にパフォーマンスを再現できるように、主要なソースコード、ネットワークの重み、およびグラウンドトゥルースの注釈を公開しています。

要約(オリジナル)

We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of $0.8796\%$ and an average surface distance of $1.0379$ pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.

arxiv情報

著者	Chengxi Zeng,Xinyu Yang,Majid Mirmehdi,Alberto M Gambaruto,Tilo Burghardt
発行日	2022-08-17 14:28:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー