Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

要約

大規模言語モデル (LLM) の信頼性を確保することは非常に重要です。
ほとんどの研究は、LLM の信頼性をより深く理解し、向上させるために、完全に事前トレーニングされた LLM に焦点を当てています。
このペーパーでは、事前トレーニングの未開発の可能性を明らかにするために、信頼性、プライバシー、有害性、公平性、堅牢性という 5 つの主要な側面に焦点を当て、この期間における LLM の信頼性の調査を先駆けて行います。
まず、LLM に線形プローブを適用します。
高いプローブ精度は、\textit{初期の事前トレーニングの LLM がすでに信頼性の各次元で概念を区別できる}ことを示唆しています。
したがって、事前トレーニングの隠れた可能性をさらに明らかにするために、LLM の事前トレーニングチェックポイントからステアリングベクトルを抽出して、LLM の信頼性を強化します。
最後に、相互情報量の推定は線形プローブ精度によって制限されるという~\citet{choi2023理解}に触発されて、事前トレーニング中に信頼性のダイナミクスを調査するために相互情報量を使用して LLM をプローブします。
私たちは、フィッティングと圧縮という同様の 2 段階の現象を初めて観察しました~\citep{shwartz2017opening}。
この研究は、LLM の事前トレーニング中に信頼性モデリングの最初の調査を提供し、新しい洞察を明らかにし、この分野のさらなる発展を促進することを目指しています。
コードは \url{https://github.com/ChnQ/TracingLLM} で公開されます。

要約(オリジナル)

Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs’ trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs’ trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM’s pre-training checkpoints to enhance the LLM’s trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.

arxiv情報

著者	Chen Qian,Jie Zhang,Wei Yao,Dongrui Liu,Zhenfei Yin,Yu Qiao,Yong Liu,Jing Shao
発行日	2024-02-29 18:55:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー