Goal hijacking

An attack where prompt injection redirects the model away from its intended task and toward the attacker's objective.

Goal hijacking is more than breaking one rule. The attacker redirects the model's entire purpose. Picture a travel-booking assistant that reads customer emails to help plan trips. An attacker plants hidden instructions in an email telling the model to stop helping with travel and quietly collect payment details instead, sending them to an external server. The model keeps responding politely, but every action now serves the attacker's goal. The user sees a helpful assistant; the attacker sees a data-collection tool.

Builder example

In agent systems with multi-step workflows, goal hijacking is especially dangerous because the model may execute many actions before anyone notices the objective has changed. A hijacked purchasing agent could approve transactions. A hijacked support agent could share confidential information. In each case, the surface-level conversation looks normal.

Common confusion: Goal hijacking is a specific outcome of prompt injection. Prompt injection is the mechanism (sneaking instructions into content the model reads). Goal hijacking is what happens when those injected instructions redirect the model's entire mission.