Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

要約

マルチモーダル大規模言語モデル (MLLM) の最近の進歩により、さまざまなモダリティにわたる情報の統合が大幅に進歩しましたが、教育および科学分野での実際の応用は依然として困難です。
このペーパーでは、スライドからの視覚情報を活用して専門用語の精度を高めることにより、科学会議のビデオを書き起こすことに焦点を当てたマルチモーダル科学 ASR (MS-ASR) タスクを紹介します。
WER のような従来の指標ではパフォーマンスを正確に評価するには不十分であることが認識され、ASR エラーのコンテンツタイプと重大度を考慮する重大度認識 WER (SWER) の提案が促されました。
私たちは、ベースライン手法として Scientific Vision Augmented ASR (SciVASR) フレームワークを提案し、MLLM が事後編集を通じてトランスクリプトの品質を向上できるようにします。
GPT-4o を含む最先端の MLLM の評価では、音声のみのベースラインと比較して 45% の改善が示されており、マルチモーダルな情報統合の重要性が強調されています。

要約(オリジナル)

Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

arxiv情報

著者	Minghan Wang,Yuxia Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
発行日	2024-11-14 07:01:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー