Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

要約

このペーパーでは、MLC-SLM Challenge 2025のシステムを紹介し、大規模な言語モデル（LLMS）を使用した多言語認識と言語モデリングに焦点を当てています。
私たちのアプローチは、微調整されたささやきのささやき-V3エンコーダーと、効率的なプロジェクターアーキテクチャとさまざまなデコーダー構成を組み合わせています。
エンコーダー、プロジェクター、およびLLMコンポーネントを徐々に最適化する3段階のトレーニング方法を採用しています。
当社のシステムは、QWEN2.5-7Bをデコーダーのみの言語モデルとして使用して、GEMMA3-12Bを使用して16.63％のプライベートテストの平均WER/CER結果で競争力のあるパフォーマンスを達成します。

要約(オリジナル)

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

arxiv情報

著者	Tuan Nguyen,Long-Vu Hoang,Huy-Dat Tran
発行日	2025-06-16 15:23:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー