Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental resu...

Read Original Article →

Source

http://arxiv.org/abs/2605.13801v1