Loss Functions and Operators Generated by f-Divergences

要約

ロジスティック損失（別名 – エントロピー損失）は、マルチクラス分類に使用される最も人気のある損失関数の1つです。
また、言語モデリングの次のトークン予測に最適な損失関数です。
これは、Kullback-Leibler（KL）DivergenceとSofterGMaxオペレーターに関連付けられています。
この作業では、$ f $ divergencesに基づいて新しい凸損失関数を構築することを提案します。
私たちの損失関数は、2つの方向にロジスティック損失を一般化します。i）KL発散を$ f $ -divergencesに置き換えること、ii）非均一な参照測定を許可することにより。
多数の$ f $ divergencesのフレームワークを即座にインスタンス化し、既存の損失を回復し、新しい損失を作成します。
ロジスティック損失との類似性により、$ f $ -divergenceによって生成される損失関数は、オペレーターに関連付けられており、$ f $ -softargmaxをダブします。
$ f $ divergenceに関連付けられた$ f $ -softargmaxを計算するための新しい並列化可能な二等分アルゴリズムを導き出します。
経験的側面では、このペーパーの目標の1つは、トレーニング前、トレーニング後（SFT）、蒸留など、言語モデル設定の古典的な交差点を超えた損失関数の有効性を決定することです。
$ \ alpha $ divergence（ユニット参照測定の場合はtsallis $ \ alpha $ -negentropyに相当する$ \ alpha $ divergenceによって生成される損失関数が$ \ alpha = 1.5 $でいくつかのタスクでうまく機能することを示します。

要約(オリジナル)

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback–Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.

arxiv情報

著者	Vincent Roulet,Tianlin Liu,Nino Vieillard,Michael E. Sander,Mathieu Blondel
発行日	2025-01-30 18:06:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Loss Functions and Operators Generated by f-Divergences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー