Multimodal Web Navigation with Instruction-Finetuned Foundation Models

要約

自律的な Web ナビゲーションの進歩は、オンライン強化学習を介した何十億もの探索的インタラクションへの依存と、豊富なドメイン外データからの一般化の活用を困難にするドメイン固有のモデル設計によって妨げられてきました。
この研究では、ビジョン言語基盤モデルを使用した Web エージェントのデータ駆動型オフライントレーニングを研究します。
私たちは、Web ページのスクリーンショットと HTML ページの両方を監視し、クリックや入力などの Web ナビゲーションアクションを出力する、命令に従うマルチモーダルエージェント WebGUM を提案します。
WebGUM は、命令で微調整された言語モデルと、デモンストレーションの大規模なコーパスに対する時間的および局所的な知覚を備えたビジョンエンコーダーを共同で微調整することによってトレーニングされます。
私たちは、このレシピがエージェントの根拠のあるマルチモーダル知覚、HTML 理解、およびマルチステップ推論の能力を向上させ、以前の研究を大幅に上回るパフォーマンスを示すことを経験的に示しています。
MiniWoB では、これまでの最良のオフライン手法よりも 45.8% 以上向上し、オンラインで微調整された SoTA、人間、および GPT-4 ベースのエージェントをも上回っています。
WebShop ベンチマークでは、当社の 30 億パラメータモデルは、既存の SoTA、PaLM-540B よりも優れたパフォーマンスを達成しています。
さらに、WebGUM は、Mind2Web 上の現実世界の計画タスクへの強力な積極的な移行を示します。
また、トレーニング済みモデルを使用して、以前の研究の 38 倍となる 347,000 件の高品質のデモンストレーションを収集し、この方向での将来の研究を促進するために利用できるようにしています。

要約(オリジナル)

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent’s ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

arxiv情報

著者	Hiroki Furuta,Kuang-Huei Lee,Ofir Nachum,Yutaka Matsuo,Aleksandra Faust,Shixiang Shane Gu,Izzeddin Gur
発行日	2023-10-01 10:15:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー