Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

要約

深層学習の台頭と大規模なオーディオビジュアルデータベースの利用可能性のおかげで、視覚音声認識 (VSR) は最近の進歩を遂げています。
他の音声処理タスクと同様に、これらのエンドツーエンド VSR システムは通常、エンコーダ/デコーダアーキテクチャに基づいています。
エンコーダはある程度汎用的ですが、隠れマルコフモデル (DNN-HMM) やコネクショニスト時間分類 (CTC) パラダイムと組み合わせたディープニューラルネットワークに基づく従来のハイブリッドモデルなど、複数のデコードアプローチが検討されています。
ただし、データが不足している言語やタスクもあり、この状況では、異なるタイプのデコーダー間の明確な比較はありません。
したがって、私たちは、従来の DNN-HMM デコーダとそれに対応する最先端の CTC/Attention が、推定に使用されるデータ量に応じてどのように動作するかに焦点を当てて研究しました。
また、同様のデータセットまたは別の言語用に収集されたデータセットを考慮して、視覚音声特徴が明示的にトレーニングされていないシナリオにどの程度適応できるかを分析しました。
結果は、従来のパラダイムが、トレーニング時間の短縮とパラメータの削減とともに、データ不足シナリオにおける CTC/注意モデルを改善する認識率に達したことを示しました。

要約(オリジナル)

Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.

arxiv情報

著者	David Gimeno-Gómez,Carlos-D. Martínez-Hinarejos
発行日	2024-02-20 13:33:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー