Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

要約

多くの自然言語生成 (NLG) タスクは、入力プロンプトが与えられた場合に単一の出力テキストを生成することを目的としています。
他の設定では、合成トラフィック生成 (STG) など、複数のテキストの生成が必要です。
この生成タスクは、QA システムや会話エージェントのトレーニングと評価にとって重要です。その目標は、実際のユーザーの言語の多様性に似た複数の質問や発話を生成することです。
この論文では、BLEU のような一般的な NLG メトリクスが STG の評価には適していないことを示します。
私たちは、生成されたトラフィックと実際のユーザーテキストの分布を比較するために設計されたいくつかの指標を提案し、評価します。
自動手順を使用してメトリクスを検証し、生成されたデータのさまざまな種類の品質問題を捉えているかどうかを検証します。
また、人間によるアノテーションを実行して、人間の判断との相関関係を検証します。
ショッピング発話生成、商品質問生成、クエリ自動完了の 3 つのタスクに関する実験では、私たちの指標が STG タスクの評価に効果的であり、一般的な NLG 指標に関して人間の判断との一致が最大 20% 向上することが実証されました。
これらの発見は、合成テキストデータの代表性を推定するためのより良いソリューションへの道を開くことができると私たちは信じています。

要約(オリジナル)

Many Natural Language Generation (NLG) tasks aim to generate a single output text given an input prompt. Other settings require the generation of multiple texts, e.g., for Synthetic Traffic Generation (STG). This generation task is crucial for training and evaluating QA systems as well as conversational agents, where the goal is to generate multiple questions or utterances resembling the linguistic variability of real users. In this paper, we show that common NLG metrics, like BLEU, are not suitable for evaluating STG. We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts. We validate our metrics with an automatic procedure to verify whether they capture different types of quality issues of generated data; we also run human annotations to verify the correlation with human judgements. Experiments on three tasks, i.e., Shopping Utterance Generation, Product Question Generation and Query Auto Completion, demonstrate that our metrics are effective for evaluating STG tasks, and improve the agreement with human judgement up to 20% with respect to common NLG metrics. We believe these findings can pave the way towards better solutions for estimating the representativeness of synthetic text data.

arxiv情報

著者	Simone Filice,Jason Ingyu Choi,Giuseppe Castellucci,Eugene Agichtein,Oleg Rokhlenko
発行日	2023-11-21 11:26:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー