Logos as a Well-Tempered Pre-train for Sign Language Recognition

要約

このペーパーでは、孤立した手話認識（ISLR）タスクの2つの側面を検証します。
第一に、多くのデータセットが可用性にもかかわらず、ほとんどの個々の看板のデータの量は限られています。
それは、転送学習を含む言語間ISLRモデルトレーニングの課題を提起します。
第二に、同様の兆候には異なる意味の意味があります。
それはデータセットのラベル付けのあいまいさにつながり、そのような兆候に注釈を付けるための最良のポリシーの問題を提起します。
これらの問題に対処するために、この研究では、署名者の数と利用可能な最大のデータセットの1つによって最も広範なISLRデータセットであるロジアン手話（RSL）データセットであるロゴスを提示し、サイズと語彙の最大のRSLデータセットも提示します。
ロゴセットで事前に訓練されたモデルは、少数のショット学習を含む他の言語SLRタスクのユニバーサルエンコーダーとして使用できることが示されています。
言語間転送学習アプローチを探り、複数の分類ヘッドを使用した共同トレーニングは、ターゲットローリソースデータセットの精度に最も役立つことを発見します。
ロゴデータセットの重要な機能は、視覚的に類似した標識グループが明示的に注釈されています。
視覚的に類似した標識を明示的にラベル付けすることで、ダウンストリームタスクの視覚エンコーダーとして訓練されたモデルの品質が向上することを示します。
提案された貢献に基づいて、WLASLデータセットの現在の最先端の結果を上回り、Autslデータセットの競合結果を取得し、単一のストリームモデルがRGBビデオのみを処理します。
ソースコード、データセット、および事前に訓練されたモデルは公開されています。

要約(オリジナル)

This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

arxiv情報

著者	Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev
発行日	2025-05-15 16:31:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Logos as a Well-Tempered Pre-train for Sign Language Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー