Late multimodal fusion for image and audio music transcription

要約

音楽ソースを構造化されたデジタル形式に変換する音楽のトランスクリプションは、音楽情報検索 (MIR) にとって重要な問題です。
この課題に計算の観点から取り組むとき、MIR コミュニティは 2 つの研究分野に従います。光音楽認識 (OMR) の場合である音楽ドキュメント、または自動音楽転写 (AMT) の場合であるオーディオ録音です。
前述の入力データのさまざまな性質により、これらのフィールドはモダリティ固有のフレームワークを開発するように調整されています。
ただし、シーケンスのラベル付けタスクに関する最近の定義は、共通の出力表現につながり、組み合わせたパラダイムに関する研究を可能にします。
この点で、マルチモーダルな画像とオーディオの音楽転写には、画像とオーディオのモダリティによって伝達される情報を効果的に組み合わせるという課題が含まれます。
この作業では、この問題を後期融合レベルで調査します。最初に、格子ベースの検索空間でエンドツーエンドの OMR および AMT システムに関する仮説をマージするために、4 つの組み合わせアプローチを研究します。
一連のパフォーマンスシナリオ (対応する単一モダリティモデルが異なるエラー率を生成する) で得られた結果は、これらのアプローチの興味深い利点を示しました。
さらに、検討された 4 つの戦略のうちの 2 つは、対応する単一モードの標準認識フレームワークを大幅に改善します。

要約(オリジナル)

Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios — in which the corresponding single-modality models yield different error rates — showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.

arxiv情報

著者	María Alfaro-Contreras,Jose J. Valero-Mas,José M. Iñesta,Jorge Calvo-Zaragoza
発行日	2022-08-12 17:39:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Late multimodal fusion for image and audio music transcription

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー