Preference Optimization for Combinatorial Optimization Problems

要約

強化学習（RL）は、神経組み合わせの最適化の強力なツールとして浮上しており、モデルが専門知識を必要とせずに複雑な問題を解決するヒューリスティックを学習できるようにしています。
大きな進歩にもかかわらず、既存のRLアプローチは、報酬信号の減少や広大な組み合わせのアクション空間での非効率的な探査などの課題に直面し、非効率性につながります。
このホワイトペーパーでは、統計的比較モデリングを介して定量的好みのシグナルを定性的優先信号に変換し、サンプリングされたソリューション間の優位性を強調する優先順位の最適化を提案します。
方法論的には、ポリシーの観点から報酬関数を修復し、優先モデルを利用することにより、扱いにくい計算を避けながら、ポリシーを好みに直接合わせたエントロピー正規化RL目的を策定します。
さらに、地元の検索手法を後処理ではなく微調整に統合して高品質の優先ペアを生成し、ポリシーがローカルオプティマを逃れるのに役立ちます。
巡回セールスマンの問題（TSP）、コンパシテート車両ルーティング問題（CVRP）、柔軟なフローショップの問題（FFSP）などのさまざまなベンチマークの経験的結果は、我々の方法が既存のRLアルゴリズムを大幅に上回り、優れた収束効率とソリューションの品質を達成することを示しています。

要約(オリジナル)

Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-processing to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

arxiv情報

著者	Mingjun Pan,Guanquan Lin,You-Wei Luo,Bin Zhu,Zhien Dai,Lijun Sun,Chun Yuan
発行日	2025-05-13 16:47:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preference Optimization for Combinatorial Optimization Problems

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー