Masked Audio Generation using a Single Non-Autoregressive Transformer

要約

オーディオトークンの複数のストリームに対して直接動作する、マスクされた生成シーケンスモデリング手法である MAGNeT を紹介します。
これまでの研究とは異なり、MAGNeT は 1 段の非自己回帰トランスフォーマーで構成されています。
トレーニング中に、マスキングスケジューラーから取得したマスクされたトークンのスパンを予測します。一方、推論中に、いくつかのデコードステップを使用して出力シーケンスを徐々に構築します。
生成されたオーディオの品質をさらに向上させるために、外部の事前トレーニング済みモデルを活用して MAGNeT からの予測を再スコアリングしてランク付けする新しいスコアリング方法を導入しました。この方法は、後のデコード手順で使用されます。
最後に、MAGNeT のハイブリッドバージョンを検討します。このバージョンでは、自己回帰モデルと非自己回帰モデルを融合して、シーケンスの残りの部分が並行してデコードされている間に、自己回帰方式で最初の数秒を生成します。
私たちは、テキストから音楽への変換およびテキストからオーディオへの生成タスクに対する MAGNeT の効率を実証し、客観的な指標と人体研究の両方を考慮して広範な実証的評価を実施します。
提案されたアプローチは、評価されたベースラインと同等ですが、大幅に高速です (自己回帰ベースラインよりも 7 倍高速です)。
アブレーションの研究と分析を通じて、レイテンシ、スループット、生成品質を考慮して、自己回帰モデリングと非自己回帰モデリングの間のトレードオフを指摘するとともに、MAGNeT を構成する各コンポーネントの重要性を明らかにしました。
サンプルは、デモページ https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT で入手できます。

要約(オリジナル)

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

arxiv情報

著者	Alon Ziv,Itai Gat,Gael Le Lan,Tal Remez,Felix Kreuk,Alexandre Défossez,Jade Copet,Gabriel Synnaeve,Yossi Adi
発行日	2024-01-09 14:29:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Audio Generation using a Single Non-Autoregressive Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー