SparQLe: Speech Queries to Text Translation Through LLMs

要約

大規模な言語モデル（LLMS）の影響力が高まっているため、音声表現を統合して、よりシームレスなマルチモーダル処理と音声理解を可能にすることに関心が高まっています。
この研究では、スピーチツーテキスト翻訳のために命令チューニングLLMと組み合わせて自己監視された音声表現を活用する新しいアプローチを紹介します。
提案されたアプローチは、モダリティアダプターを活用して、抽出された音声機能を英語言語データを使用して命令チューニングLLMと整列させます。
我々の実験は、この方法が入力音声の意味的な内容を効果的に保持し、自己教師の音声モデルと命令チューニングLLMとの間の効果的な橋渡しとして機能し、さまざまな音声理解アプリケーションに有望なソリューションを提供することを示しています。

要約(オリジナル)

With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.

arxiv情報

著者	Amirbek Djanibekov,Hanan Aldarmaki
発行日	2025-02-13 12:57:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SparQLe: Speech Queries to Text Translation Through LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー