Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration

要約

現実世界の機械学習プロジェクトにおいて、量と品質の要件を満たすために、ラベル付きトレーニングデータを取得することは依然としてコストのかかる作業です。
最近、大規模言語モデル (LLM)、特に GPT-4 は、高精度のデータのラベル付けに大きな期待を寄せています。
ただし、プライバシーとコストの問題により、GPT-4 の普及は妨げられています。
この作業では、自動ラベル付けにオープンソースモデルを効果的に活用する方法を検討します。
私たちは、ラベルスキーマの統合を有望なテクノロジとして特定しましたが、分類にラベル記述を単純に使用すると、カーディナリティの高いタスクのパフォーマンスが低下することがわかりました。
これに対処するために、LLM が対応するラベルスキーマを使用して一度に 1 つのラベルの推論を実行する検索拡張分類 (RAC) を提案します。
最も関連性の高いラベルから開始し、LLM によってラベルが選択されるまで繰り返します。
ラベル記述を動的に統合する私たちの方法が、ラベル付けタスクのパフォーマンスの向上につながることを示します。
さらに、最も有望なラベルのみに焦点を当てることにより、RAC はラベルの品質とカバレッジの間でトレードオフができることを示します。この特性は、内部データセットに自動的にラベルを付けるために利用されます。

要約(オリジナル)

Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage – a property we leverage to automatically label our internal datasets.

arxiv情報

著者	Thomas Walshe,Sae Young Moon,Chunyang Xiao,Yawwani Gunawardana,Fran Silavong
発行日	2025-01-21 18:06:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー