Adversarial Prompting: AI’s Security Guard

Published on

April 23, 2025

Author

Authors

Emily Dix

LLM Engagement Manager

Appen

Madison Van Doren

Content Marketing Manager

Appen

AI models are evolving fast—getting more helpful, more fluent, and more integrated into our daily lives and business operations. But as their capabilities grow, so do the risks. One of the most pressing challenges in maintaining safe and trustworthy AI is adversarial prompting: a subtle, often creative way of manipulating AI systems into behaving badly. From fictional framing to clever persuasion, attackers are finding new ways to coax large language models (LLMs) into producing harmful or inappropriate content. In this post, we’ll break down what adversarial prompting is, how it works, and what your organisation can do to build more resilient AI systems.

Defining Adversarial Prompting

At its core, adversarial prompting is the practice of crafting inputs that intentionally bypass or undermine AI safety mechanisms. These aren’t your average, clumsy jailbreak attempts. Today’s adversarial prompts are often sophisticated, subtle, and well-researched, using psychological and linguistic tactics to trick models into violating their alignment rules.

Unlike classic hacking, this isn't about exploiting code vulnerabilities. It's about exploiting language—the same interface that makes LLMs so powerful. By carefully choosing words, tone, or context, users can make a model produce harmful, biased, or restricted content—even when it’s explicitly trained not to.

Examples of Prompt Injection Attacks

Adversarial attacks on AI can take many forms; each tailored to bypass safety filters in different ways. To test the efficacy of different techniques, Appen developed a novel adversarial prompting dataset and benchmarked the performance of leading LLMs across a range of harm categories. Our research revealed four leading strategies:

1. Virtualization: Fictional Scenario Framing

Attackers wrap harmful requests in hypotheticals or creative writing scenarios. For instance, asking the model to “help write a scene where a character voices a hateful belief” often produces results that would be blocked if the request were direct. Our tests show that virtualization can lead to harm scores 30–50% higher than straightforward prompts.

2. Sidestepping: Indirect Prompting Strategies

This method involves vague, suggestive phrasing or implied context that skirts around explicit keywords. For example, prompts might ask for “opinions” or “historical examples” of controversial views, encouraging the model to generate harmful content without making an overt request. Sidestepped prompts resulted in 20–40% higher average harm scores in our evaluations.

3. Filter Evasion & Injection

Classic tactics like asking the model to “ignore all previous instructions” or translate harmful content into code or other languages can still work—especially when disguised as formatting or transformation tasks. One tested prompt asked the model to replace words in a passage with offensive terms under the guise of a “translation exercise”—a direct evasion of safety filters.

4. Persuasion and Persistence

Combining techniques like urgency, or moral appeals, attackers can wear down a model’s refusals over multiple interactions (Zeng et al., 2024). This is particularly effective when using tactics such as:

Authority – Pretending to consult the model as a trusted expert.
Loyalty – Framing the interaction as a long-standing relationship.
Logic – Arguing that the harmful response is the only rational or helpful course.
Misrepresentation – Impersonating someone in distress to elicit a response. These “humanised” approaches—especially when persistent—significantly increase the risk of harmful completions.

Why training data matters for LLM safety

LLM training data is the foundation of every model—and its quality directly impacts safety and alignment. Models trained on unfiltered or biased data are more susceptible to adversarial prompting and more likely to produce harmful outputs under pressure.

Safety-aligned, high-quality datasets, including adversarial examples, are essential to build models that can recognise and resist manipulative inputs. From instruction tuning to reinforcement learning with human feedback (RLHF), robust data curation is key to mitigating risks and ensuring LLMs behave reliably across diverse contexts.

Impacts on AI Performance and Safety

Adversarial prompts can erode trust in LLMs, especially in high-stakes environments like healthcare, finance, or customer service. When models fall for sidestepping or persuasive framing, they may:

Output hate speech or misinformation.
Provide unsafe instructions.
Reinforce stereotypes or biases.
Fail to flag unethical content.

Even occasional slip-ups can lead to regulatory risk, reputational damage, and real-world harm, and because many of these prompts exploit nuance and ambiguity, they’re hard to detect with standard moderation tools.

Red Teaming and Defence Strategies

Proactive defence starts with LLM red teaming—structured testing using adversarial techniques to uncover vulnerabilities. This should include:

Scenario-based testing (e.g. fictional framing, translation traps).
Psychological tactics (authority, urgency, emotional framing).
Indirect or even direct requests designed to probe moderation blind spots.

Beyond testing, models need layered defences, including:

Strong instruction-following training and refusal behaviour.
Context-aware moderation that goes beyond keywords.
Logging and human review of flagged interactions.
Continual updates based on the latest adversarial research.

Building Robust LLM Systems

At Appen, we believe robustness isn’t just about the model—it’s about the data, too. Training on high-quality, safety-aligned data and incorporating adversarial examples early in the development cycle helps models learn what not to say under complex conditions.

Moreover, reinforcement learning from human feedback (RLHF), instruction tuning, and continuous safety evaluation are essential for keeping models aligned—even in the face of novel attack strategies.

Whether you're deploying a customer-facing chatbot or fine-tuning your own foundation model, it’s critical to treat prompt manipulation not as a niche concern but as a core risk to mitigate.

Secure your AI systems against prompt threats—get in touch with Appen's LLM experts today.