LLM Benchmark Datasets Should Be Contamination-Resistant

Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $\textit{contaminated}$, which diminishes their value as reliable measures of model generalization...

Read Original Article →

Source

http://arxiv.org/abs/2605.19999v1