VIP5: Towards Multimodal Foundation Models for Recommendation

要約

コンピュータービジョン (CV)、自然言語処理 (NLP)、およびレコメンダーシステム (RecSys) は、伝統的に独立して開発されてきた 3 つの著名な AI アプリケーションであり、その結果、モデリングとエンジニアリングの方法論が異なっています。
これにより、これらの分野が互いの進歩から直接恩恵を受けることが妨げられてきました。
ウェブ上でマルチモーダルデータが利用可能になるにつれて、ユーザーに推奨事項を作成する際にさまざまなモダリティを考慮する必要性が高まっています。
最近の基礎モデルの出現により、さまざまなモダリティや問題の定式化を統合するための潜在的な汎用インターフェイスとして、大規模な言語モデルが登場しました。
これを考慮して、さまざまなモダリティとレコメンデーションタスクを統合するために、P5レコメンデーションパラダイム（VIP5）の下で視覚的モダリティとテキストモダリティの両方を考慮することにより、マルチモーダル基盤モデルの開発を提案します。
これにより、視覚、言語、およびパーソナライゼーション情報を共有アーキテクチャで処理できるようになり、レコメンデーションが改善されます。
これを達成するために、共有形式で複数のモダリティに対応するマルチモーダルのパーソナライズされたプロンプトを導入します。
さらに、基礎モデルのパラメーター効率の高いトレーニング方法を提案します。これには、バックボーンのフリーズと軽量アダプターの微調整が含まれ、その結果、推奨パフォーマンスが向上し、トレーニング時間とメモリ使用量の効率が向上します。

要約(オリジナル)

Computer Vision (CV), Natural Language Processing (NLP), and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other’s advancements. With the increasing availability of multimodal data on the web, there is a growing need to consider various modalities when making recommendations for users. With the recent emergence of foundation models, large language models have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model by considering both visual and textual modalities under the P5 recommendation paradigm (VIP5) to unify various modalities and recommendation tasks. This will enable the processing of vision, language, and personalization information in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage.

arxiv情報

著者	Shijie Geng,Juntao Tan,Shuchang Liu,Zuohui Fu,Yongfeng Zhang
発行日	2023-05-23 17:43:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VIP5: Towards Multimodal Foundation Models for Recommendation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー