Multimodal Web Navigation with Instruction-Finetuned Foundation Models

要約

自律的な Web ナビゲーションの進歩は、オンライン強化学習を介した何十億もの探索的インタラクションへの依存と、豊富なドメイン外データからの一般化の活用を困難にするドメイン固有のモデル設計によって妨げられてきました。
この研究では、ビジョン言語基盤モデルを使用した Web エージェントのデータ駆動型オフライントレーニングを研究します。
私たちは、Web ページのスクリーンショットと HTML ページの両方を監視し、クリックや入力などの Web ナビゲーションアクションを出力する、命令に従うマルチモーダルエージェント WebGUM を提案します。
WebGUM は、デモンストレーションの大規模なコーパス上で、命令で微調整された言語モデルとビジョントランスフォーマーを共同で微調整することによってトレーニングされます。
私たちは、このレシピがエージェントの根拠のある視覚認識、HTML 理解、および複数ステップの推論の能力を向上させ、以前の研究を大幅に上回るパフォーマンスを示すことを経験的に示しています。
MiniWoB ベンチマークでは、これまでの最良のオフライン手法より 31.9% 以上改善し、オンラインで微調整された SoTA に近づいています。
WebShop ベンチマークでは、当社の 30 億パラメータモデルは、既存の SoTA、PaLM-540B よりも優れたパフォーマンスを達成しています。
また、トレーニング済みモデルを使用して、以前の研究の 38 倍となる 347,000 件の高品質のデモンストレーションを収集し、この方向での将来の研究を促進するために利用できるようにしています。

要約(オリジナル)

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent’s ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

arxiv情報

著者	Hiroki Furuta,Ofir Nachum,Kuang-Huei Lee,Yutaka Matsuo,Shixiang Shane Gu,Izzeddin Gur
発行日	2023-05-19 17:44:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー