Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

要約

最近、SpeechGPT、VioLA、AudioPaLM などの統合音声テキストモデルは、音声タスクで顕著なパフォーマンスを達成しました。
これらのモデルは、連続音声信号を離散トークンに変換し (音声離散化)、テキストと音声トークンを共有語彙にマージします。
次に、音声タスクの混合で単一のデコーダ専用の Transformer をトレーニングします。
具体的には、これらすべてのモデルは、ASR タスクの入力音声トークンに対して損失マスキングを利用します。これは、これらのモデルが音声トークン間の依存関係を明示的にモデル化していないことを意味します。
この論文では、テキストと同様に自己回帰的な方法で音声トークンのシーケンスをモデル化することを試みます。
ただし、入力音声トークンに従来のクロスエントロピー損失を適用しても、損失マスキングよりも ASR パフォーマンスが一貫して向上しないことがわかりました。
したがって、我々は、平滑化ラベル蒸留（SLD）と呼ばれる新しいアプローチを提案します。これは、入力音声トークンに平滑化されたラベルを使用して KL 発散損失を導入し、音声トークンを効果的にモデル化します。
実験では、私たちの SLD アプローチがクロスエントロピー損失の制限を軽減し、さまざまな音声離散化手法を使用したデコーダのみの Transformer ベースの ASR の損失マスキングよりも一貫して優れていることを示しています。

要約(オリジナル)

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on speech tasks. These models convert continuous speech signals into discrete tokens (speech discretization) and merge text and speech tokens into a shared vocabulary. Then they train a single decoder-only Transformer on a mixture of speech tasks. Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens. In this paper, we attempt to model the sequence of speech tokens in an autoregressive manner like text. However, we find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over Loss Masking. Therefore, we propose a novel approach denoted Smoothed Label Distillation (SLD), which introduces a KL divergence loss with smoothed labels on the input speech tokens to effectively model speech tokens. Experiments demonstrate that our SLD approach alleviates the limitations of the cross-entropy loss and consistently outperforms Loss Masking for decoder-only Transformer based ASR using different speech discretization methods.

arxiv情報

著者	Qian Chen,Wen Wang,Qinglin Zhang,Siqi Zheng,Shiliang Zhang,Chong Deng,Yukun Ma,Hai Yu,Jiaqing Liu,Chong Zhang
発行日	2023-11-08 08:45:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー