Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable ...

Read Original Article →

Source

http://arxiv.org/abs/2606.19057v1