Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

要約

バックプロパゲーションのような1次方法を使用した微調整LLMは、計算的に集中的です。
ゼロオーダー（ZO）最適化は、勾配の代わりに関数評価を使用して、メモリの使用量を削減しますが、高次元モデルでの収束が遅いことに苦しんでいます。
その結果、LLMSのZO研究は主に分類に焦点を当てており、より複雑な生成タスクを見落としています。
このホワイトペーパーでは、LLMSで\ TextIT {Preference Optimization}用に設計された新しいZoアルゴリズムであるZoproを紹介します。
まず、従来の（1次）優先最適化中にポリシーモデルと報酬モデルの相互作用を分析し、相対的な更新のパターンを明らかにします。
これらの洞察に導かれて、収束を加速するためのターゲットサンプリング戦略に同時摂動確率的近似（SPSA）を適応させます。
要約、機械の翻訳、会話アシスタントの実験を通じて、私たちの方法は、一次方法に匹敵する収束時間を達成しながら、報酬信号を一貫して強化することを実証します。
最先端の方法には及ばないが、私たちの作業は、LLMSの優先最適化にゼロオーダーの方法を適用し、分類タスクを超えて、ほとんど未踏の研究方向への道を開く最初の方法です。
コードと視覚化は、https：//github.com/alessiogalatolo/viszoproで入手できます

要約(オリジナル)

Fine-tuning LLMs with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation, using function evaluations instead of gradients, reduces memory usage but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for \textit{Preference Optimisation} in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO

arxiv情報

著者	Alessio Galatolo,Zhenbang Dai,Katie Winkle,Meriem Beloucif
発行日	2025-03-05 12:49:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー