Under-refusal

When a model complies with a request it should have refused. Someone asks for instructions to do something dangerous and the model helpfully provides them.

Under-refusal is the dangerous counterpart to over-refusal. The model helps with something it should decline: generating harmful instructions, assisting with social engineering, writing exploitable code, producing prohibited content. Safety training is imperfect, and adversarial users (through jailbreaks and prompt injection) actively search for gaps. Making a model less restrictive to reduce over-refusal inherently increases the risk of under-refusal, and vice versa.

Builder example

A single under-refusal in a high-stakes domain can cause serious harm and legal liability. If your product connects a language model to tools that take real-world actions (sending emails, executing code, modifying data), an under-refusal becomes an action. The cost of a missed refusal scales with the power of the tools the model controls.

Common confusion: A model can sound helpful, polite, and thorough while producing something genuinely dangerous. Helpfulness and safety are independent dimensions; excelling at one does not guarantee the other.