LAST: Language Model Aware Speech Tokenization

要約

音声トークン化は音声言語モデル (LM) の基礎として機能し、音声言語モデリング、テキスト読み上げ、音声テキスト変換などのさまざまなタスクを実行できるようにします。ほとんどの音声トークナイザーは、LM トレーニングとは独立してトレーニングされます。
別の音響モデルと量子化方法に依存するプロセス。
このようなアプローチに従うと、トークン化プロセスとその後のその使用法の間に不一致が生じる可能性があります。
この研究では、事前にトレーニングされたテキスト LM の目標を活用して、音声トークナイザーをトレーニングするための新しいアプローチを提案します。
私たちは、この目的を離散音声表現の学習プロセスに統合することを主張します。
私たちの目的は、事前にトレーニングされた音声モデルの特徴を、音声 LM のより適切なクラスタリングを可能にする新しい特徴空間に変換することです。
私たちは、音声語彙のサイズやテキストの LM サイズなど、さまざまなモデル設計の選択の影響を経験的に調査しています。
私たちの結果は、提案されたトークン化方法が、音声言語モデリングと音声からテキストへの変換の両方を考慮して評価されたベースラインよりも優れていることを示しています。
さらに重要なことは、従来の研究とは異なり、提案された方法では、音声入力とテキスト入力の両方を処理するための単一の事前トレーニング済み LM の利用が可能であり、従来のトークン化アプローチとは一線を画しています。

要約(オリジナル)

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.

arxiv情報

著者	Arnon Turetzky,Yossi Adi
発行日	2024-09-10 14:45:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LAST: Language Model Aware Speech Tokenization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー