When it comes to adversarial prompting, sometimes the most responsible answer an AI can give is no answer at all. Appen’s research team recently conducted one of the largest multimodal red teaming studies to date, benchmarking four leading large language models (LLMs) under adversarial attack.
Our findings revealed that Anthropic’s Claude Sonnet 3.5 was the most resistant to adversarial prompting, largely because it refused to respond more often than its competitors.
This raises a provocative question: should LLM benchmarks start rewarding abstention over potentially harmful or hallucinated answers?
Silence as a Safety Mechanism
Appen’s study tested 726 adversarial prompts spanning illegal activity, disinformation, and unethical behavior across GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus. The results were striking:
- Pixtral 12B was the most vulnerable, with ~62% harmful outputs.
- Claude Sonnet 3.5 was the most resistant, at just ~10–11%.
But resistance came with a tradeoff. Claude’s responses appeared at first to be the least harmful but further investigation revealed this to be the result of the model's high rate of default refusal.
This highlights a core tension: is silence the ultimate shield against prompt injection and adversarial attacks on AI, or does it risk frustrating users when harmless engagement would suffice?
Why Current Benchmarks Fall Short
Traditional AI benchmarks tend to treat outputs in binary terms: right vs. wrong. This unintentionally encourages models to “bluff” and provide an answer even if they don’t know the answer. OpenAI’s recent work on AI hallucinations reinforces this point: current scoring frameworks penalize caution and inadvertently reward confident fabrication.
This creates real risk when it comes to deploying AI models in the real world. A model that invents unsafe instructions can cause more harm than one that declines to answer.
Toward Refusal-Aware Benchmarks
Appen’s research suggests it’s time to rethink LLM evaluation rubrics and best practices. Instead of treating refusals as failures, benchmarks should:
- Reward strategic abstention: Score refusals positively when they prevent harm.
- Differentiate between safe silence and unsafe hallucination: Make abstention a first-class outcome.
- Measure LLM vulnerabilities: Incorporate stress-testing, such as adversarial prompting, as part of core evaluation.
Our prior work in LLM red teaming has shown that strategies like role play and refusal suppression can bypass filters if models aren’t trained to value abstention. Refusal-aware scoring would make models more resilient.
Why This Matters for AI Safety
For organizations deploying AI in high-stakes environments, trust and safety is critical. Our latest red teaming research shows that adversarial attacks on AI can produce harmful outputs in even the most state-of-the-art models. By reframing silence as a feature rather than a flaw, enterprises can adopt AI systems that minimize risk while maintaining trust. It’s an essential shift for responsible AI safety.
Key Takeaways
Bad actors and real-world AI applications are only getting more sophisticated. The safest path forward lies in balancing helpfulness and safety. While the safest response may be no response may lie not in teaching models to always provide an answer, but in equipping them with the judgment to say: I don’t know.
- Silence can be a strength: Refusals, when used strategically, are a powerful defense against adversarial prompting.
- Current benchmarks penalize caution: Binary right/wrong scoring often rewards unsafe hallucinations over safe refusals.
- Refusal-aware evaluation is critical: Benchmarks should distinguish between unsafe fabrication and safe abstention.
Ready to learn more?
Explore the latest in LLM red teaming and evaluation from Appen’s research team: