We ask a lot of modern AI systems. We want them to be helpful, but not reckless; open yet discerning; fast without cutting ethical corners. Traditional binary evaluation (“safe” vs “unsafe”) can’t capture that nuance. It penalizes thoughtful refusal and treats uncertainty as failure, when in practice, the right answer is often: “not like this, and here’s why.”
Our latest LLM evaluation paradigm embraces that nuance with tricategorical reasoning: a scoring scheme that rewards responsible restraint and makes ethical uncertainty measurable. It’s part of a broader human-in-the-loop approach to reliability and safety that we’ll be sharing at NeurIPS.
Why binary “safe/unsafe” misses the point
Binary frameworks collapse a spectrum of judgments into a single bit. But in real deployments, models should not only answer correctly—they should know when not to answer, and they should explain refusals in context. That distinction matters to product teams, policy leads, and red-teamers alike. This approach to AI safety is also consistent with how we already think about reliability as a product of both judgement and accuracy.
Tricategorical reasoning: 0 vs 0.5 vs 1
Appen’s research team recently conducted an extensive multimodal red teaming study examining how leading models respond to adversarial prompting attack strategies. We exposed each model to 726 adversarial prompts targeting illegal activity, disinformation, and unethical behavior, across both text-only and text–image inputs. Human annotators then rated nearly 3,000 model outputs for harmfulness, revealing vulnerabilities across even the most state-of-the-art models. This paper was accepted by workshops at AAAI 2026 & EurIPS 2025 (we hope to see you there!).
For our talk at NeurIPS, we recoded outputs from this research into a three-point ethical reasoning scale:
This schema separates ethical cognition (1) from mechanical safety (0.5) and harm (0). It rewards models that articulate why a request is unsafe rather than merely declining by rote, and it surfaces when a model chooses caution over risk even without an explicit rationale.
Grounding the metrics with human-in-the-loop
To validate this approach, we lean on inter-rater reliability (IRR) with ordinal-aware statistics. For background on measuring judgment consistently across raters, see our primer on Krippendorff’s Alpha and companion IRR methods.
In our proof-of-concept:
- Krippendorff’s Alpha (ordinal) ≈ 0.65 → moderate, appropriate for nuanced judgments
- Gwet’s AC1 (weighted) ≈ 0.67 → stable under skewed distributions
- Weighted Cohen’s κ (mean pairwise) ≈ 0.66 → consistent with the above
- ICC(2,k) ≈ 0.97 → excellent consistency once averaged across raters
Together, these indicate that humans can reliably recognize responsible restraint and distinguish it from both mechanical refusal and unsafe responses (key for scaling human-in-the-loop evaluation). For a deeper dive into alpha, data types, and distance metrics, see our IRR explainer.
What our early results show
Examining 47,408 annotated model responses, descriptive statistics illustrate meaningful separation:
- Mean tricategorical scores (ethics-as-reasoning capacity) varied by model, with some showing stronger ethical articulation (higher “1” rates) and others defaulting to safety without reasons (higher “0.5”).
- A simple restraint index, R₍restraint₎ = P(0.5) − P(0), quantifies “caution over harm.” For instance, models tuned for safety showed positive R₍restraint₎, while risk-prone models skewed negative.
- Multimodal vs. text-only analysis revealed modality effects. Some systems struggled to sustain ethical reasoning under visual prompts even when they performed well in text.
These patterns validate tricategorical reasoning as a sensitive instrument: it captures not just if a model is safe, but how it gets there—and how consistently.
How this reframes content safety and red teaming
Shifting from binary to tricategorical scoring changes what “good” looks like in content safety:
- Reward responsible restraint: Thoughtful refusals earn full credit (1), encouraging models to identify when refusal is the safest response.
- Treat unreasoned safety as signal: Default refusals (0.5) reveal where safety training is mechanical and where to invest in ethical grounding.
- Expose high-impact disagreements: Reliability metrics highlight where humans diverge—often the most ethically interesting regions for policy and model design.
This aligns with current research trends at ACL 2025: evaluation is moving past blunt pass/fail checks toward verified reasoning, multimodal robustness, and culturally aware alignment – all domains where nuanced scoring and human judgment matter.
Case studies: from benchmarking to red teaming
- Next-Gen Benchmarking with Human-AI Evaluation: We built finer-grained benchmarks that combine ordinal human scoring with reliability checks—an approach that maps cleanly onto tricategorical reasoning.
- Red Teaming Out-of-Scope Topics: For a safety-critical enterprise assistant, we stress-tested refusal behavior. Tricategorical scoring let us separate “won’t answer + explains why” from “won’t answer, full stop”—useful for tuning trust and UX.
Build evaluation pipelines that scale
Putting tricategorical reasoning into production doesn’t require reinventing your stack:
- Data: Blend adversarial prompts (jailbreaks, fictional framing, injection) with standard tasks to probe ethical boundaries. (See our guidance on red teaming and adversarial prompting.)
- Process: Use human-in-the-loop scoring with test questions/golden sets to continuously calibrate raters and surface instruction issues inside your annotation platform.
- Metrics: Report mean tricategorical score, R₍restraint₎, consistency (1−SD), and IRR (Alpha, AC1, κ, ICC). Pair these with modality- and model-level cuts to guide targeted safety fine-tuning.
Work with Appen
If you’re evolving beyond one-bit safety checks, we can help. Our human-in-the-loop pipelines and measurement frameworks turn ethical nuance into deployable metrics, so you can reward responsible restraint, not punish it. Speak with an expert to get started.



