Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

要約

視覚的音声処理では、唇の動きの曖昧な性質により、コンテキストモデリング機能が最も重要な要件の 1 つとなります。
たとえば、唇の動きは同じだが異なる音を発する単語である同音異義語は、文脈を考慮することで区別できます。
本稿では、LLM の圧倒的なパワーを利用してコンテキストモデリング能力を最大限に高めるための新しいフレームワーク、すなわち LLM を組み込んだ Visual Speech Processing (VSP-LLM) を提案します。
具体的には、VSP-LLM は視覚音声認識と翻訳のマルチタスクを実行するように設計されており、指定された命令によってタスクの種類が制御されます。
入力ビデオは、自己教師あり視覚音声モデルを採用することにより、LLM の入力潜在空間にマッピングされます。
入力フレームに冗長な情報が存在するという事実に焦点を当て、視覚的音声単位を使用して埋め込まれた視覚的特徴を削減する新しい重複排除手法を提案します。
提案された重複排除と低ランク適応 (LoRA) を通じて、VSP-LLM は計算効率の高い方法でトレーニングできます。
翻訳データセットである MuAViC ベンチマークでは、わずか 30 時間のラベル付きデータでトレーニングされた VSP-LLM が、433 時間のデータでトレーニングされた最近のモデルと比較して、より効果的に唇の動きを翻訳できることを実証しました。

要約(オリジナル)

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.

arxiv情報

著者	Jeong Hun Yeo,Seunghee Han,Minsu Kim,Yong Man Ro
発行日	2024-05-14 02:58:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー