Cross-Modal Mutual Learning for Cued Speech Recognition

要約

ACSR（Automatic Cued Speech Recognition）は、聴覚障害者のために、唇の動きや手のジェスチャーを利用して話し言葉をコード化し、視覚的コミュニケーションのためのインテリジェントなヒューマンマシンインターフェースを提供するシステムです。これまでのACSRアプローチでは、主な融合パラダイムとして直接特徴連結を利用することが多い。しかし、音声合成では、非同期なモダリティ（口唇、手の形、手の位置）が特徴量連結の妨げになる可能性があります。この課題を解決するために、我々はマルチモーダルなインタラクションを促すためのトランスフォーマーに基づくクロスモーダル相互学習フレームワークを提案する。本モデルは、従来の自己アテンションと比較して、異なるモダリティの情報をモダリティ不変のコードブックを通過させ、各モダリティのトークンに対して言語表現を照合する。そして、この共有された言語的知識を用いて、マルチモーダルなシーケンスを再同期させる。さらに、我々は北京語を対象とした大規模多言語CSデータセットを構築する。我々の知る限り、これは中国語のためのACSRに関する最初の研究である。本論文では、中国語、フランス語、イギリス英語という異なる言語に対して、幅広い実験を行いました。その結果、本手法の認識性能は最先端技術に対して大きなマージンをもって優れていることが分かりました。

要約(オリジナル)

Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities (\textit{i.e.}, lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages (\textit{i.e.}, Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.

arxiv情報

著者	Lei Liu,Li Liu
発行日	2022-12-02 10:45:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Cross-Modal Mutual Learning for Cued Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー