Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

要約

以前の手話翻訳 (SLT) メソッドは、光沢注釈に依存することで優れたパフォーマンスを実現しました。
ただし、高品質の光沢のラベル付けは労働集約的な作業であるため、SLT のさらなる開発は制限されます。
一部のアプローチは、ビジュアルエンコーダーと翻訳ネットワークを共同トレーニングすることで光沢フリー SLT に取り組んでいますが、これらの取り組みでは依然として、パフォーマンスの低下と強力なラージ言語モデル (LLM) の非効率的な使用に悩まされています。
最も深刻なことに、LLM が学習曲線の大半を占めるため、LLM を SLT に直接導入すると視覚表現の学習が不十分になることがわかりました。
これらの問題に対処するために、光沢なし SLT 用に大規模言語モデルを使用した因数分解学習支援 (FLa-LLM) を提案します。
具体的には、トレーニングプロセスを 2 つの段階に分解します。
ビジュアルの初期化段階では、ビジュアルエンコーダーの後に軽量の変換モデルを使用して、ビジュアルエンコーダーを事前トレーニングします。
LLM の微調整段階では、ビジュアルエンコーダーで取得した知識を凍結し、それを事前トレーニングされた LLM と統合して、LLM の翻訳の可能性を刺激します。
この因数分解されたトレーニング戦略は、すべて光沢なし設定で実行された 3 つの SLT データセット全体で大幅な改善が達成されたことからわかるように、非常に効果的であることが証明されています。

要約(オリジナル)

Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM’s translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.

arxiv情報

著者	Zhigang Chen,Benjia Zhou,Jun Li,Jun Wan,Zhen Lei,Ning Jiang,Quan Lu,Guoqing Zhao
発行日	2024-03-19 09:00:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー