H2OVL-Mississippi Vision Language Models Technical Report

要約

小型ビジョン言語モデル (VLM) は、企業の商業文書や画像を処理するために消費者向けハードウェア上で効率的に実行できるため、プライバシーを重視したオンデバイスアプリケーションにとってますます重要になっています。
これらのモデルには、人間と機械の対話を強化するための強力な言語理解と視覚機能が必要です。
このニーズに対処するために、8 x H100 GPU で 240 時間のコンピューティングを使用して 3,700 万の画像とテキストのペアでトレーニングされた 2 つの小型 VLM である H2OVL-Mississippi を紹介します。
H2OVL-Mississippi-0.8B は、テキスト認識に特化した 8 億個のパラメーターを備えた小型モデルで、OCRBench のテキスト認識部分で最先端のパフォーマンスを達成し、この分野でははるかに大規模なモデルを上回っています。
さらに、一般的なユースケース向けの 20 億パラメータモデルである H2OVL-Mississippi-2B をリリースし、さまざまな学術ベンチマークにわたって非常に競争力のあるメトリクスを示します。
どちらのモデルも、H2O-Danube 言語モデルを使用した以前の研究に基づいて構築されており、その機能を視覚領域に拡張しています。
これらを Apache 2.0 ライセンスの下でリリースすることで、誰もが VLM にアクセスできるようになり、ドキュメント AI とビジュアル LLM が民主化されます。

要約(オリジナル)

Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.

arxiv情報

著者	Shaikat Galib,Shanshan Wang,Guanshuo Xu,Pascal Pfeiffer,Ryan Chesler,Mark Landry,Sri Satish Ambati
発行日	2024-10-17 14:46:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

H2OVL-Mississippi Vision Language Models Technical Report

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー