Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

要約

このホワイトペーパーでは、オーディオデータとビデオデータの両方を活用して、生の喉頭ビデオセグメントとメトリックを自動的に抽出し、臨床評価を支援するための主要なビデオセグメントとメトリックを自動的に抽出するマルチモーダル喉頭鏡ビデオ分析システム（MLVAS）を紹介します。
このシステムは、ビデオベースのglottis検出をオーディオキーワードスポッティング方法と統合して、ビデオデータとオーディオデータの両方を分析し、患者の発声を識別し、ビデオのハイライトを洗練して、ボーカルフォールドの動きを最適に検査することを保証します。
生の喉頭ビデオからの主要なビデオセグメント抽出を超えて、MLVAはボーカルフォールド麻痺（VFP）検出のための効果的なオーディオと視覚的機能を生成できます。
事前に訓練されたオーディオエンコーダーは、患者の音声をエンコードしてオーディオ機能を取得するために使用されます。
視覚的特徴は、セグメント化された声門マスク上の推定声門正中線に左右のボーカルフォールドの角度偏差を測定することにより生成されます。
より良いマスクを取得するために、誤検知を減らすために従来のU-Netセグメンテーションに続く拡散ベースの改良を導入します。
提案されたMLVAの各モジュールの有効性とモダリティを実証するために、いくつかのアブレーション研究を実施しました。
パブリックセグメンテーションデータセットの実験結果は、提案されたセグメンテーションモジュールの有効性を示しています。
さらに、現実世界のクリニックデータセットでの一方的なVFP分類結果は、信頼できる客観的なメトリックを提供するMLVASの能力と、臨床診断を支援するための視覚化を実証しています。

要約(オリジナル)

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS’s ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

arxiv情報

著者	Yucong Zhang,Xin Zou,Jinshan Yang,Wenjun Chen,Juan Liu,Faya Liang,Ming Li
発行日	2025-04-22 15:32:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー