LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

要約

NLP モデルが複雑になるにつれて、その決定を理解することがより重要になります。
反事実 (CF) は、入力への最小限の変更でモデルの予測を反転させるもので、これらのモデルを説明する方法を提供します。
大規模言語モデル (LLM) は NLP タスクにおいて顕著なパフォーマンスを示していますが、高品質の CF を生成する際のその有効性は依然として不確かです。
この研究では、LLM が 2 つの NLU タスクに対して CF をどの程度適切に生成するかを調査することで、このギャップを埋めています。
私たちは、いくつかの一般的な LLM の包括的な比較を実行し、それらの CF を評価して、固有のメトリクスとデータ拡張に対するこれらの CF の影響の両方を評価します。
さらに、人間の CF と LLM が生成した CF の違いを分析し、将来の研究の方向性への洞察を提供します。
私たちの結果は、LLM は流暢な CF を生成しますが、誘発された変更を最小限に抑えるのに苦労していることを示しています。
センチメント分析 (SA) 用の CF の生成は、LLM が元のラベルを反転する CF の生成に弱点を示す NLI よりも困難ではありません。
これはデータ拡張パフォーマンスにも反映されており、人間の CF と LLM の CF による拡張の間には大きなギャップが観察されます。
さらに、ラベルが誤ったデータ設定で CF を評価する LLM の能力を評価し、提供されたラベルに同意することに強いバイアスがあることを示しました。
GPT4 はこのバイアスに対してより堅牢であり、そのスコアは自動メトリクスとよく相関しています。
私たちの調査結果はいくつかの限界を明らかにし、将来の作業の方向性を示唆しています。

要約(オリジナル)

As NLP models become more complex, understanding their decisions becomes more crucial. Counterfactuals (CFs), where minimal changes to inputs flip a model’s prediction, offer a way to explain these models. While Large Language Models (LLMs) have shown remarkable performance in NLP tasks, their efficacy in generating high-quality CFs remains uncertain. This work fills this gap by investigating how well LLMs generate CFs for two NLU tasks. We conduct a comprehensive comparison of several common LLMs, and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Moreover, we analyze differences between human and LLM-generated CFs, providing insights for future research directions. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal. Generating CFs for Sentiment Analysis (SA) is less challenging than NLI where LLMs show weaknesses in generating CFs that flip the original label. This also reflects on the data augmentation performance, where we observe a large gap between augmenting with human and LLMs CFs. Furthermore, we evaluate LLMs’ ability to assess CFs in a mislabelled data setting, and show that they have a strong bias towards agreeing with the provided labels. GPT4 is more robust against this bias and its scores correlate well with automatic metrics. Our findings reveal several limitations and point to potential future work directions.

arxiv情報

著者	Van Bach Nguyen,Paul Youssef,Christin Seifert,Jörg Schlötterer
発行日	2024-11-12 11:49:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー