Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structur...

Read Original Article →

Source

http://arxiv.org/abs/2605.14517v1