Leveraging Large Text Corpora for End-to-End Speech Summarization

要約

End-to-end Speech Summarization (E2E SSum) は、音声から要約文を直接生成する手法です。
自動音声認識 (ASR) とテキスト要約モデルを組み合わせたカスケードアプローチと比較して、E2E アプローチは、ASR エラーを軽減し、非言語情報を組み込み、システム全体を簡素化するため、より有望です。
ただし、大量のペアデータ (つまり、音声と要約) を収集することは難しいため、トレーニングデータは通常、堅牢な E2E SSum システムをトレーニングするには不十分です。
このホワイトペーパーでは、E2E SSum トレーニングのために大量の外部テキスト要約データを活用する 2 つの新しい方法を紹介します。
最初の手法は、テキスト読み上げ (TTS) システムを利用して合成音声を生成することです。これは、テキストサマリーを使用した E2E SSum トレーニングに使用されます。
2 つ目は、合成音声の代わりに音素シーケンスを E2E SSum モデルに直接入力する TTS フリーの方法です。
実験では、提案された TTS ベースおよび音素ベースの方法が、How2 データセットのいくつかのメトリックを改善することが示されています。
特に、私たちの最高のシステムは、以前の最先端のシステムを大幅に上回っています (つまり、METEOR スコアが 6 ポイント以上改善されています)。
私たちの知る限り、これは E2E SSum に外部言語リソースを使用する最初の作業です。
さらに、提案した E2E SSum システムの有効性を確認するために、How2 データセットの詳細な分析を報告します。

要約(オリジナル)

End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.

arxiv情報

著者	Kohei Matsuura,Takanori Ashihara,Takafumi Moriya,Tomohiro Tanaka,Atsunori Ogawa,Marc Delcroix,Ryo Masumura
発行日	2023-03-02 05:19:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Leveraging Large Text Corpora for End-to-End Speech Summarization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー