FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

要約

LLMSの効果的なパーソナライズは、仮想アシスタントやコンテンツキュレーションなどの幅広いユーザーインターフェースアプリケーションにとって重要です。
LLMSの強力なコンテキスト内学習機能に触発され、Meta-Learningの問題としてモデリングに報酬を再構成する少数のショット優先最適化（FSPO）を提案します。
このフレームワークの下で、LLMは、そのユーザーからのいくつかのラベル付き設定を介してユーザーに迅速に適応することを学び、パーソナライズされた報酬機能を構築します。
さらに、現実世界の選好データは規模で収集するのが不足しており、困難なので、パーソナライズの合成選好データセットを構築するための慎重な設計選択を提案し、公開されたLLMを使用して100万以上の合成パーソナライズされた好みを生成します。
特に、合成データから実際のユーザーに正常に転送するために、データが高い多様性と一貫性のある自己整合性構造の両方を示すことが重要であると感じます。
3つのドメインにわたって最大1,500人の合成ユーザーのパーソナライズされたオープンエンド生成のFSPOを評価します。映画のレビュー、教育的背景に基づく教育学的適応、および一般的な質問応答と、対照的な人間の研究です。
全体として、FSPOは、合成ユーザーにパーソナライズされた応答と、オープンエンドの質問応答で本物の人間ユーザーと72％のウィンレートを生成する際に、平均して87％のAlpaca Eval Winrateを達成します。

要約(オリジナル)

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

arxiv情報

著者	Anikait Singh,Sheryl Hsu,Kyle Hsu,Eric Mitchell,Stefano Ermon,Tatsunori Hashimoto,Archit Sharma,Chelsea Finn
発行日	2025-02-26 17:08:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー