How Cohere Scaled Preference-Based Fine-Tuning for Enterprise LLMs

July 11, 2025

Introduction

Aligning LLM performance with human values is a key differentiator in today’s competitive AI market. However, operationalising human feedback at scale while maintaining high-quality inputs and low latency poses several challenges. To address this growing demand, Cohere built PANDA Plus, a program for preference data generation and reward signal development, and partnered with Appen to source expert annotators, support real-time model feedback, and deliver human-centric LLM training data for both experimental and production fine-tuning. Appen enabled scalable, high-quality data generation and real-time annotation for PANDA Plus — supporting Cohere in improving their generative Large Language Model, Command.

About Cohere

Cohere is the leading security-first enterprise AI company. They build cutting-edge AI models and end-to-end solutions designed to solve real-world business problems. Their flagship generative LLM series, optimised for secure enterprise deployments, is called Command. Leading enterprises in regulated industries trust Cohere with customer-facing and internal support use cases, so it is essential that the model produces helpful, safe, and brand-aligned responses across diverse domains from retail to banking. Maintaining this high standard requires continual reinforcement learning and fine-tuning with reliable, domain-relevant human feedback.

To accelerate Command’s performance, Cohere developed Preference Annotation Data Acquisition Plus Supervised Fine-Tuning (SFT), also known as PANDA Plus. This program improves model performance by collecting structured human preference data and editing the preferred response to better satisfy Command’s principles and the user’s instructions. Cohere collaborated with Appen to scale this system across live models while maintaining quality and adaptability.

1. Project Goals

PANDA Plus integrates real-time model evaluation and editing into Cohere’s training loop. Each task presents annotators with two model completions for a given prompt and asks them to:

Choose the more helpful or aligned response
Optionally edit a completion to better reflect ideal model behaviour
Provide justification and qualitative feedback
Complete supervised fine-tuning completion rewrites

Cohere partnered with Appen to:

Ensure consistent, high-quality annotations from contributors with LLM experience
Reduce latency for model feedback using Appen’s real-time delivery system
Support dynamic task variants (e.g. chat continuation, open-ended instruction-following)
Enable both experimental and production-ready training cycles

2. Challenges

A. Finding Qualified Annotators

Cohere required annotators familiar with LLMs who could provide the best quality data and efficient onboarding. Appen provided Cohere with a vetted pool of 200 US-English language contributors, prioritising prior LLM/RLHF experience.

B. Prioritising Quality over Volume

Unlike traditional annotation pipelines, PANDA Plus emphasised handling time and fidelity over throughput. This required tuning incentive structures and managing contributor pacing to optimise for thoughtful, context-aware edits.

C. Real-Time Feedback Loop

PANDA Plus required a live connection to Command’s API, enabling contributors to evaluate model outputs in near-real time. Appen adapted its AI Chat Feedback Tool to interface with PANDA Plus, including dynamic preambles, prompt routing, and response comparison.

D. Supporting Model Evolution

Cohere fine-tuned a production-grade model using Appen-generated preference data, while parallel PANDA Plus tasks fed into ongoing experimental variants. This required Appen to maintain annotation consistency across shifting model checkpoints, without compromising data structure or quality.

3. Solutions

Step One: Expert Contributor Pipeline

Appen assembled a domain-qualified contributor pool tailored for PANDA Plus. Contributors were trained to evaluate:

Usefulness, safety, and tone
Instruction adherence and domain relevance
Opportunities for refinement or escalation

Appen contributors performed:

A/B preference ranking
Multi-turn chat continuation scoring
Freeform feedback for tooling and prompt iteration
Complex prompt and preamble writing
Completion re-writing for “perfect” SFT inputs

Step Two: Tooling and Real-Time Delivery

The PANDA Plus workflow was delivered through a custom deployment of Appen’s AI Data Platform (ADAP), with enhancements including:

Direct integration with Command’s inference endpoint
Multi-turn prompt/response workflows
Structured fields for ranking, editing, and justification
Weekly batch summaries and daily live data streams

Appen contributors logged over 2,400 expert hours in 12 weeks, enabling Command’s training loop to incorporate human feedback in near-real time.

4. Results

High-Confidence Fine-Tuning Data

PANDA Plus data contributed directly to the Command model, with multiple fine-tuning runs leveraging human preference signals collected by Appen.

Support for Experimental Training

Beyond production, PANDA Plus also supported research-grade experimentation offering long-term value for model iteration.

Contributor Retention and Quality

Appen maintained a consistent contributor pool over the project’s 12-week duration, ensuring stable annotation behaviour and predictable performance across variants.

System-Level Impact

By integrating real-time model interaction, edit-based supervision, and crowd feedback into PANDA Plus, Cohere advanced its alignment pipeline — with Appen playing a key role in turning subjective preference into structured AI training data.

Conclusion

Cohere’s collaboration with Appen on PANDA Plus is a model example of enterprise-scale preference training, including:

Skilled annotators with LLM context
Custom tooling for real-time feedback
Structured editing and justification
Integration with both research and production fine-tuning loops

As frontier model builders look to scale human feedback efficiently and responsibly, PANDA Plus demonstrates how data partnerships can drive both model performance and alignment quality — without sacrificing control, safety, or enterprise readiness.