Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

要約

この取り組みは、視覚的に豊かな文書理解 (VDU) タスクのためのスケーラブルな運用環境におけるパフォーマンスと効率の間のバランスのとれたアプローチの必要性に対処します。
現在、高度な機能を提供する大規模なドキュメント基盤モデルに依存していますが、計算負荷が高くなります。
この論文では、さまざまなトレーニング戦略、出口層のタイプ、配置を組み込んだマルチモーダル早期出口 (EE) モデル設計を提案します。
私たちの目標は、マルチモーダルな文書画像分類の予測パフォーマンスと効率の間のパレート最適バランスを達成することです。
包括的な一連の実験を通じて、私たちのアプローチを従来の出口ポリシーと比較し、パフォーマンスと効率のトレードオフの改善を示します。
当社のマルチモーダル EE 設計はモデルの予測機能を維持し、速度と遅延の両方を向上させます。
これは、ベースラインの精度を完全に維持しながら、レイテンシを 20% 以上削減することで実現されます。
この研究は、VDU コミュニティ内でのマルチモーダル EE 設計の最初の調査を表しており、さまざまなレイヤーでの終了の信頼スコアを向上させるキャリブレーションの有効性も強調しています。
全体として、私たちの発見は、パフォーマンスと効率の両方を向上させることにより、実用的な VDU アプリケーションに貢献します。

要約(オリジナル)

This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model’s predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.

arxiv情報

著者	Omar Hamed,Souhail Bakkali,Marie-Francine Moens,Matthew Blaschko,Jordy Van Landeghem
発行日	2024-05-21 11:52:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー