Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches


この問題に対処するために、私たちはまず、HealthVidQA-CRF と HealthVidQA-Prompt という 2 つの大規模なデータセットを作成するパイプライン アプローチを提案しました。
私たちは、作成されたデータセットがモデルのトレーニングに与える影響と、モノモーダルおよびマルチモーダル アプローチのパフォーマンス向上における視覚的特徴の重要性に焦点を当て、結果の包括的な分析を実施しました。


The increase in the availability of online videos has transformed the way we access information and knowledge. A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. Toward this, this paper is focused on answering health-related questions asked by the public by providing visual answers from medical videos. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions. To address this issue, we first proposed a pipelined approach to create two large-scale datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Later, we proposed monomodal and multimodal approaches that can effectively provide visual answers from medical videos to natural language questions. We conducted a comprehensive analysis of the results, focusing on the impact of the created datasets on model training and the significance of visual features in enhancing the performance of the monomodal and multi-modal approaches. Our findings suggest that these datasets have the potential to enhance the performance of medical visual answer localization tasks and provide a promising future direction to further enhance the performance by using pre-trained language-vision models.


著者 Deepak Gupta,Kush Attal,Dina Demner-Fushman
発行日 2023-09-21 16:21:28+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CL パーマリンク