Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

要約

私たちは、既存のエンドツーエンドのダイアライゼーションモデルと比較して型破りな目的でトレーニングされた、話者ダイアライゼーションのための新しいニューラルモデルである Sortformer を提案します。
話者ダイアリゼーションにおける順列問題は、長い間、重大な課題とみなされてきました。
従来のエンドツーエンドダイアライゼーションシステムのほとんどは、誤差が最小になる順列を最適化する順列不変損失 (PIL) を採用しています。
対照的に、PIL の有無にかかわらず、ダイアライゼーションモデルが順列を自律的に解決できるようにする Sort Loss を導入します。
ソートロスと PIL を組み合わせることで、PIL のみでトレーニングされた最先端のエンドツーエンドダイアライゼーションモデルと競合するパフォーマンスが達成されることを実証します。
重要なのは、話者監視モデルとして Sortformer を利用し、正弦波カーネル関数を使用して ASR エンコーダ状態内に話者ラベル推定を埋め込む、合理化されたマルチスピーカー ASR アーキテクチャを提示することです。
このアプローチは、分類された目的を通じて話者順列問題を解決し、話者ラベルのタイムスタンプと話者トークンを効果的に橋渡しします。
私たちの実験では、提案されたマルチスピーカー ASR アーキテクチャがスピーカー監視によって強化され、アダプター技術によってパフォーマンスが向上することを示しました。
コードとトレーニングされたモデルは、NVIDIA NeMo フレームワーク経由で公開されます。

要約(オリジナル)

We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework

arxiv情報

著者	Taejin Park,Ivan Medennikov,Kunal Dhawan,Weiqing Wang,He Huang,Nithin Rao Koluguri,Krishna C. Puvvada,Jagadeesh Balam,Boris Ginsburg
発行日	2024-09-10 17:20:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー