Old Is New Again: How Rubrics and Fine-Tuning Work Together in LLM Evaluation

Published on

July 22, 2025

Author

Authors

Ryan Kolln

Chief Executive Officer

Appen

No items found.

As generative AI systems mature, two approaches are proving essential to shaping and assessing model quality: supervised fine-tuning (SFT) and rubric-based evaluation.

But how do these approaches differ?

SFT is a training method, teaching models what kinds of responses to produce by exposing them to curated examples (often written by humans) that demonstrate ideal behaviour. On the other hand, rubric-based evaluation functions as a measurement method, helping us to assess whether a model’s output meets expectations after it has been fine-tuned by scoring model outputs based on structured criteria such as like helpfulness, tone, factuality, or safety.

In short: SFT shapes the model. Rubrics score its performance. One teaches; the other judges.

At Appen, we’ve spent decades helping leading AI companies operationalise human judgment. Our roots in search relevance – from early link ranking to complex, context-specific query matching – taught us how to scale subjective decisions with structure and consistency. Defining quality, calibrating reviewers, and applying nuanced rubrics remain essential skills for large language model (LLM) evaluation today. The core principles that underpinned high-quality search relevance for years are now being retooled to assess LLM outputs and operationalising human judgment at scale.

Search Relevance and Generative Rubrics: Two Sides of the Same Coin

Modern search evaluation has long moved past binary notions of “relevant” vs. “not.” Our work in this space has evolved to factor in dimensions like intent, context, tone, and trustworthiness. Evaluators are trained to apply nuanced rubrics tailored to different use cases—sound familiar? It’s exactly what leading labs are doing now to evaluate outputs from large language models.

Whether it’s rating a chatbot’s helpfulness, a summary’s factual grounding, or a generated response’s inclusivity, rubric-based evaluation mirrors what relevance raters have done for years: make context-rich, subjective judgments grounded in clear guidelines.

And increasingly, those judgments are going beyond post-hoc analysis to power live feedback loops for model training and alignment.

Case in Point: Cohere’s Preference-Based Fine-Tuning

One recent example is our collaboration with Cohere on their PANDA Plus initiative (Preference Annotation Data Acquisition Plus Supervised Fine-Tuning). As part of their mission to build secure, enterprise-ready LLMs, Cohere needed high-quality human feedback to fine-tune their Command model in real time, across both production and experimental settings.

Appen provided a vetted pool of expert annotators with LLM experience and deployed a custom real-time feedback tool that connected directly to Cohere’s model endpoint. Together, Cohere and Appen fine-tuned their model with curated examples and leveraged human evaluations to identify stronger outputs, rewrite weaker ones, and feed those back into the fine-tuning loop.

Annotators performed structured evaluations such as:

A/B comparison of model responses
Instruction-based completion rewrites
Freeform feedback and rubric-based edits

By delivering subjective quality at scale, Appen helped Cohere prioritise high-fidelity responses over throughput, emphasising thoughtful revisions and domain-aware judgement. In just 12 weeks, Appen contributors logged over 2,400 expert hours, helping feed PANDA Plus with structured preference data and targeted feedback that informed ongoing fine-tuning cycles.

Appen’s Edge: Scaling Subjectivity with Discipline

We’ve been trusted by the largest players in AI for over 25 years because of our ability to bring consistency and scale to inherently subjective judgments, operationalising nuance by:

Designing clear, robust rubrics that match user expectations
Calibrating evaluators with gold standards and performance feedback
Running real-time quality audits and blind overlap checks
Enabling tooling that supports live feedback and dynamic model evolution

These same mechanisms, honed over years of relevance labelling, are now critical for companies like Cohere that are integrating human-in-the-loop signals directly into the LLM training process.

From Links to Language: The Evolution Continues

As generative systems move into open-ended domains, the need for high-fidelity human judgment has never been greater. And while the interfaces have changed—from SERPs to chatbots, from snippets to synthetic personas—the foundational data annotation challenge remains: how do we know what “good” looks like?

At Appen, we support both sides of the equation: teaching models how to behave through structured fine-tuning data, and judging how well they perform using clear, multi-dimensional rubrics.

Whether you’re building research prototypes or enterprise-ready LLMs, delivering high-quality human feedback at scale remains essential. Let’s talk about how we can bring the rigor of relevance to your generative AI evaluation stack.

Old Is New Again: How Rubrics and Fine-Tuning Work Together in LLM Evaluation

Search Relevance and Generative Rubrics: Two Sides of the Same Coin

Case in Point: Cohere’s Preference-Based Fine-Tuning

Appen’s Edge: Scaling Subjectivity with Discipline

From Links to Language: The Evolution Continues

Related posts