ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

要約

大規模言語モデルやビジョン言語モデルなどの生成モデルによって生成される出力の品質を評価することには、顕著な課題が伴います。
従来の評価方法は通常、リソースを大量に消費する人間による評価か、人間の判断との相関が低いことが多い自動指標のいずれかに依存しています。
もう 1 つの一般的なアプローチは、深層学習システムを使用することです。これは、大量のコンピューティングと時間を消費するだけでなく、広範なトレーニングデータも必要とします。
この研究では、LLM と VLM 自体の 2 レベルの階層を活用して、テキストと画像の両方を含む生成出力を評価するように設計された、ReFeR と呼ばれるチューニング不要のフレームワークを紹介します。
私たちは、4 つの多様な評価タスクにわたってフレームワーク ReFeR を厳密に評価します。
このフレームワークは、これらの評価の精度を向上させ、以前のベンチマークを上回るだけでなく、建設的なフィードバックも生成します。
興味深いことに、このフレームワークは推論タスクにも適用できます。
4 つの推論タスクに関する実験では、フレームワークの優れた集団推論能力が実証されています。
フレームワークの 2 つのバリエーションを紹介します。1 つはパフォーマンスの高速化に最適化された ReFeR-Turbo、もう 1 つはよりコスト効率の高いソリューションを提供する ReFeR-Lite です。
ReFeR-Lite は、ReFeR-Turbo と同等の精度を持ちながら、$\sim7.7\倍$ 効率が優れています。
コード、データ、PIP パッケージを公開します。
この PIP URL https://pypi.org/project/refer-agents/ およびこの Git URL https://github.com/yaswanth-iitkgp/ReFeR_Code を参照してください。

要約(オリジナル)

Assessing the quality of outputs generated by generative models, such as large language models and vision language models, presents notable challenges. Traditional methods for evaluation typically rely on either human assessments, which are resource-intensive, or automatic metrics that often show a low correlation with human judgment. Another common approach is to use deep learning systems, which not only consume a substantial amount of compute and time but also require extensive training data. In this study, we introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images, by leveraging a 2-level hierarchy of LLMs and VLMs themselves. We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks. The framework not only improves the accuracy of these evaluations, surpassing previous benchmarks but also generates constructive feedback. Interestingly, the framework is also applicable to reasoning tasks. Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more cost-effective solution. ReFeR-Lite is $\sim7.7\times$ more efficient while being comparably accurate to ReFeR-Turbo. We make code, data and PIP package publicly available. See this PIP URL https://pypi.org/project/refer-agents/ and this Git URL https://github.com/yaswanth-iitkgp/ReFeR_Code .

arxiv情報

著者	Yaswanth Narsupalli,Abhranil Chandra,Sreevatsa Muppirala,Manish Gupta,Pawan Goyal
発行日	2024-10-09 17:51:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー