The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

要約

この論文は、2023 年の第 1 回中国語連続視覚音声認識チャレンジ (CNVSRC) で NPU-ASLP-LiAuto (チーム 237) によって導入された視覚音声認識 (VSR) システムについて概説し、シングルスピーカー VSR タスクの固定およびオープントラックに取り組んでいます。
、およびマルチスピーカー VSR タスクのオープントラック。
データ処理に関しては、ベースライン 1 からのリップモーション抽出機能を活用して、マルチスケールのビデオデータを生成します。
さらに、速度摂動、ランダムな回転、水平反転、色の変換など、さまざまな拡張技術がトレーニング中に適用されます。
VSR モデルは、ResNet3D ビジュアルフロントエンド、E-Branchformer エンコーダー、および Transformer デコーダーで構成される、共同 CTC/アテンションロスを備えたエンドツーエンドアーキテクチャを採用しています。
実験の結果、マルチシステムフュージョン後のシステムは、シングルスピーカータスクで 34.76% の CER、マルチスピーカータスクで 41.06% の CER を達成し、参加した 3 つのトラックすべてで 1 位にランクされたことが示されています。

要約(オリジナル)

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.

arxiv情報

著者	He Wang,Pengcheng Guo,Wei Chen,Pan Zhou,Lei Xie
発行日	2024-02-29 18:09:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー