Learning Temporal Resolution in Spectrogram for Audio Classification

要約

オーディオスペクトログラムは、オーディオの分類に広く使用されている時間と周波数の表現です。
オーディオスペクトログラムの重要な属性の 1 つは時間分解能であり、これは短時間フーリエ変換 (STFT) で使用されるホップサイズに依存します。
以前の研究では一般に、ホップサイズが一定の値 (たとえば、10 ミリ秒) である必要があると想定していました。
ただし、固定の時間解像度は、さまざまな種類のサウンドに対して常に最適であるとは限りません。
時間分解能は、分類の精度だけでなく、計算コストにも影響します。
この論文では、オーディオ分類のための微分可能な時間解像度モデリングを可能にする新しい方法 DiffRes を提案します。
固定ホップサイズで計算されたスペクトログラムが与えられると、DiffRes は重要なフレームを保持しながら、重要でないタイムフレームをマージします。
DiffRes は、オーディオスペクトログラムと分類器の間の「ドロップイン」モジュールとして機能し、分類タスクと合わせて最適化できます。
音響特徴としてメルスペクトログラムを使用し、その後に既製の分類子バックボーンを使用して、5 つの音声分類タスクで DiffRes を評価します。
固定時間解像度を使用する以前の方法と比較して、DiffRes ベースの方法は、少なくとも 25% の計算コスト削減で同等以上の分類精度を達成できます。
さらに、DiffRes は、計算コストを追加することなく、入力音響特徴の時間分解能を高めることで分類精度を向上できることを示します。

要約(オリジナル)

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a ‘drop-in’ module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

arxiv情報

著者	Haohe Liu,Xubo Liu,Qiuqiang Kong,Wenwu Wang,Mark D. Plumbley
発行日	2024-01-12 18:35:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Temporal Resolution in Spectrogram for Audio Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー