Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

要約

Visual Speech Recognition (VSR) は、唇の動きのみに基づいて音声をテキストに推論することを目的としています。
スピーチをモデル化するために視覚情報に焦点を当てているため、そのパフォーマンスは本質的に個人の唇の外観や動きに敏感であり、これにより、目に見えないスピーカーに適用されると VSR モデルのパフォーマンスが低下します。
この論文では、目に見えないスピーカーでの VSR モデルのパフォーマンス低下を改善するために、スピーカー適応 VSR のディープニューラルネットワーク (DNN) の迅速な調整方法を提案します。
具体的には、自然言語処理 (NLP) の最近の進歩に動機付けられて、事前トレーニング済みのモデルパラメーターを変更する代わりに、ターゲットスピーカーの適応データに関するプロンプトを微調整します。
主に Transformer バリアントアーキテクチャに限定されていた以前のプロンプトチューニング方法とは異なり、一般に CNN と Transformer で構成される VSR モデルに適用できるさまざまなタイプのプロンプト、追加、パディング、および連結フォームプロンプトを調べます。
提案されたプロンプトチューニングを使用すると、事前トレーニング済みモデルが適切でない場合でも、少量の適応データ (たとえば、5 分未満) を使用することで、目に見えない話者に対する事前トレーニング済み VSR モデルのパフォーマンスを大幅に改善できることを示します。
すでに大きなスピーカーバリエーションで開発されています。
さらに、さまざまなタイプのプロンプトのパフォーマンスとパラメーターを分析することにより、微調整方法よりもプロンプト調整が優先される場合を調査します。
提案手法の有効性は、単語レベルと文レベルの両方の VSR データベース、LRW-ID および GRID で評価されます。

要約(オリジナル)

Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.

arxiv情報

著者	Minsu Kim,Hyung-Il Kim,Yong Man Ro
発行日	2023-02-16 06:01:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー