PALO: A Polyglot Large Multimodal Model for 5B People

要約

より包括的な視覚言語モデル (VLM) を追求するために、この研究では \textsc{Palo} と呼ばれる大規模な多言語マルチモーダルモデルを導入します。
\textsc{Palo} は、英語、中国語、ヒンディー語、スペイン語、フランス語、アラビア語、ベンガル語、ロシア語、ウルドゥー語、日本語を含む 10 の主要言語で視覚的推論機能を提供し、合計 $\sim$50 億人 (65\
世界人口の%)。
私たちのアプローチには、微調整された大規模言語モデルを使用して、マルチモーダルな命令データセットを英語からターゲット言語に適応させる半自動翻訳アプローチが含まれており、これにより、最小限の手作業でスケーラビリティを実現しながら、高い言語忠実度を確保できます。
多様な命令セットを組み込むことで、複数の言語、特にヒンディー語、アラビア語、ベンガル語、ウルドゥー語など過小評価されている言語の全体的なパフォーマンスを向上させることができます。
結果として得られるモデルは 3 つのスケール (1.7B、7B、および 13B パラメーター) にわたってトレーニングされ、強力なベースラインと比較して大幅な改善が観察される一般化とスケーラビリティを示します。
また、言語をまたいで視覚言語推論能力を評価するための、今後のアプローチのための初の多言語マルチモーダルベンチマークも提案します。
コード: https://github.com/mbzuai-oryx/PALO。

要約(オリジナル)

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called \textsc{Palo}. \textsc{Palo} offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of $\sim$5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

arxiv情報

著者	Muhammad Maaz,Hanoona Rasheed,Abdelrahman Shaker,Salman Khan,Hisham Cholakal,Rao M. Anwer,Tim Baldwin,Michael Felsberg,Fahad S. Khan
発行日	2024-02-22 18:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PALO: A Polyglot Large Multimodal Model for 5B People

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー