Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

要約

大規模な言語モデル（LLM）は遍在するため、リスクと制限を理解することが重要です。
エッジデバイスなどのコンピューティングリソースが制約されている場合は、より小さなLLMを展開できますが、有害な出力を生成する傾向が異なります。
LLMの害の軽減は、通常、LLM出力の有害性に注釈を付けることに依存します。これは人間から収集するのに費用がかかります。
この作業は、2つの質問を研究しています。LLMSは、有害なコンテンツの生成に関してどのようにランク付けされますか？
より大きなLLMSはどの程度有害に注釈を付けることができますか？
3つの小さなLLMに、差別的な言葉、攻撃的な内容、プライバシー侵害、マイナスの影響など、さまざまなタイプの有害なコンテンツを引き出すように促し、生産物の人間のランキングを収集します。
次に、これらの応答の有害性に注釈を付ける能力について、3つの最先端の大規模LLMを評価します。
小さいモデルは、有害性に関して異なることがわかります。
また、大きなLLMが人間との低から中程度の一致を示すことがわかります。
これらの調査結果は、LLMSにおける危害緩和に関するさらなる作業の必要性を強調しています。

要約(オリジナル)

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.

arxiv情報

著者	Berk Atil,Vipul Gupta,Sarkar Snigdha Sarathi Das,Rebecca J. Passonneau
発行日	2025-04-21 17:30:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー