SpeechAlign: Aligning Speech Generation to Human Preferences

要約

音声言語モデルは、リアルな音声を生成する点で大幅に進歩しており、ニューラルコーデック言語モデルが際立っています。
しかし、音声出力を人間の好みに合わせるための人間のフィードバックの統合は、しばしば無視されます。
このペーパーでは、最初にコーデック言語モデルの分布ギャップを分析することでこのギャップに対処し、それがトレーニングフェーズと推論フェーズの間の不一致にどのようにつながり、パフォーマンスに悪影響を与えるかを強調します。
次に、人間のフィードバックからの学習を活用して、配布のギャップを埋めることを検討します。
音声言語モデルを人間の好みに合わせる反復的な自己改善戦略である SpeechAlign を紹介します。
SpeechAlign には、ゴールデンコーデックトークンと合成トークンを対比する優先コーデックデータセットの構築と、その後のコーデック言語モデルを改善するための優先最適化が含まれます。
この改善サイクルを繰り返し実行することで、弱いモデルを強力なモデルに着実に変換します。
主観的評価と客観的評価の両方を通じて、SpeechAlign が配布ギャップを埋め、音声言語モデルの継続的な自己改善を促進できることを示します。
さらに、SpeechAlign は強力な一般化機能を示し、小規模なモデルでも機能します。
コードとモデルは https://github.com/0nutation/SpeechGPT で入手できます。

要約(オリジナル)

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.

arxiv情報

著者	Dong Zhang,Zhaowei Li,Shimin Li,Xin Zhang,Pengyu Wang,Yaqian Zhou,Xipeng Qiu
発行日	2024-04-08 15:21:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpeechAlign: Aligning Speech Generation to Human Preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー