Gradient-controlled decoding: A safety guardrail for LLMs with dual-anchor steering

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak & prompt injection detection such as, GradSafe, detects unsafe prompts with a single 'accept all' anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ('Sure') and refusal anchor token ('Sorry') tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ('Sorry, I can't . . . ') before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8×7B, and Qwen-2-7B, and requires only 20 demonstration templates.

Read Original Article →

Source

https://www.amazon.science/publications/gradient-controlled-decoding-a-safety-guardrail-for-llms-with-dual-anchor-steering