ICLR 2025: Advances in Trustworthy Machine Learning

Published on

June 2, 2025

Author

Authors

Phoebe Liu

Staff Data Scientist

Appen

No items found.

The 2025 International Conference on Learning Representations (ICLR) showcased groundbreaking advancements from researchers at the forefront of machine learning. The event featured innovative papers, thought-provoking panels, and engaging discussions about the future of AI.

We identified several papers with important implications for LLM training data —particularly those exploring how to better integrate human judgement into AI systems, understand subjective language interpretation, and improve evaluation methods. Collectively, they signal a pivotal shift in the field: away from purely automated pipelines and towards systems that meaningfully incorporate human perspectives, diverse viewpoints, and contextual understanding. From probabilistic evaluation frameworks that more holistically capture model behaviour to novel techniques for multi-dimensional value alignment, the research presents both exciting opportunities and critical challenges for our industry.

As we continue to develop sophisticated annotation frameworks that balance automation with human expertise, these insights from ICLR 2025 will help shape our work in the years ahead.

Alignment challenges demand deeper human expertise

We are encouraged by research highlighting the need for more sophisticated human input. The papers below reinforce our belief that safety evaluation must assess complete responses—not just initial refusals—and that value alignment requires capturing multiple value dimensions simultaneously.

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Central Idea: Current alignment and unlearning evaluations often rely on deterministic outputs, overlooking the stochastic nature of language models, which can surface "forgotten" or unsafe content in less likely samples.

The authors introduce a formal probabilistic evaluation framework with new metrics for verifying unlearning and alignment across the entire model output distribution, not just single outputs. Their experiments reveal that deterministic evaluations often falsely indicate successful unlearning and alignment, while their probabilistic approach provides more reliable assessments. Their proposed techniques significantly enhance unlearning capabilities on recent benchmarks.

Why this matters: For human-in-the-loop (HITL) systems, this research influences how we verify model safety and alignment. By implementing these more nuanced assessment methods, we can ensure AI systems are truly unlearning sensitive information and aligning more deeply with human values.

Read the paper.

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

ICLR 2025 Outstanding Paper Award Winner

Central Idea: Current safety alignment methods create only "shallow safety guardrails" that can be easily circumvented because they primarily affect only the first few tokens of a model's response.

The authors show that safety-tuned LLMs start with refusal prefixes in over 96% of instances but can be tricked with harmful-start attacks. They propose two solutions: Variable Depth Safety Augmentation (VDSA), which injects refusal statements at random positions within responses, and First-few Token Regularisation (FTR), which prevents fine-tuning from deviating too far from safety-trained distributions. These methods significantly improved resistance to harmful-start attacks with minimal reduction in helpfulness.

Why this matters: This research identifies a critical weakness in current AI safety mechanisms. Understanding that safety guardrails can be circumvented when they are only "token-deep" helps organisations implement more robust safety measures that protect users throughout their entire interaction with AI systems. We should rethink our safety data collection strategies to move beyond standard refusal templates at the beginning of responses and create more diverse safety datasets with refusals and redirections at various token positions.

Read the paper.

MAP: Multi-Human-Value Alignment Palette

Central Idea: Human values are multidimensional and often conflicted—models need to balance, not simply maximise, across complex value domains (e.g., harmlessness, humour, diversity).

MAP reframes alignment by optimising within user-defined multi-dimensional "palettes," allowing practitioners to specify target levels for each value domain. It conceptualises value alignment as an optimisation problem bounded by user-defined constraints. The authors argue that traditional methods, such as reinforcement learning from human feedback (RLHF) and direct preference optimisation (DPO), struggle to integrate multiple human values without trade-offs.

Why this matters: MAP provides a structured, principled, and intuitive framework for humans to specify and control the desired behaviour of AI systems across multiple, potentially conflicting values. The most direct connection is the "value palette" vector c, where each component represents a user-defined target level for a specific human value (e.g., harmlessness, helpfulness, humour, positivity, diversity). Rather than requiring complex preference rankings or hyperparameter tuning, MAP allows users to express alignment goals in an intuitive and transparent way.

Read the paper.

Bias-aware evaluation demands diverse perspectives

At Appen, we have long championed the value of diverse perspectives in data annotation. The evaluation research at ICLR 2025 reinforces that detecting bias and estimating model confidence are becoming core capabilities for annotation providers. We are expanding our expertise in these areas to build workflows that strategically combine human judgement and automation.

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Central Idea: This study reveals that even the most advanced LLMs can be swayed by subtle factors, raising concerns about fairness and reliability. The paper identifies 12 specific biases—including verbosity bias, bandwagon bias, and distraction by irrelevant details—and introduces CALM, a framework for systematically evaluating how these biases affect model behaviour.

The authors find that models like GPT-4 struggle to avoid bias, particularly when judging subjective or emotionally charged content—mirroring human limitations in similar settings. Notably, models are more prone to bias when evaluating subjective responses than factual ones, suggesting that content type and data quality significantly affect reliability.

Why this matters: The findings support the case for bias-aware LLM evaluation frameworks. Prompt design that avoids known triggers—paired with human-in-the-loop review for high-risk evaluations—can improve the fairness and trustworthiness of LLM outputs, helping teams identify where automated judgement is sufficient and where human oversight remains essential.

Read the paper.

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Central Idea: Rather than uncritically trusting model judgements, this paper provides a principled framework for delegating tasks to AI (specifically, LLM judges) while retaining mechanisms for human oversight and ensuring overall reliability.

This framework delineates when an AI system is sufficiently confident to act autonomously and when it should defer to human judgement, thus establishing a clear boundary for human-AI collaboration in a HITL system. It integrates a sophisticated "selective evaluation" method, where human reviews are triggered only when AI confidence is low. The system employs "Simulated Annotators" for improved calibration and a "Cascaded Selective Evaluation" process, escalating from cost-effective models to stronger ones as needed.

The researchers’ approach guarantees over 80% human agreement while using cost-effective models like Mistral-7B for around 80% of evaluations.

Why this matters: This approach supports scalable, cost-efficient annotation operations by directing expert annotation efforts toward genuinely ambiguous or subjective cases, thereby optimising human resource allocation without compromising trustworthiness.

Read the paper.

Social Aspects – Subjectivity, Transparency, Trustworthiness and Human Values

Understanding how models interpret and represent human values is central to building trustworthy systems. These papers offer crucial insights into subjectivity, trust, and creativity.

AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Central Idea: This paper introduces the Creativity Index to quantify the originality of AI-generated versus human-authored content by measuring how much of the text can be reconstructed from existing web content.

Experiments across various writing tasks (novels, poetry, speeches) showed that the Creativity Index of professional human authors is substantially higher than that of LLMs—approximately 66.2% higher on average. Furthermore, Reinforcement Learning from Human Feedback (RLHF) reduced the creativity index of LLMs by 30.1%, indicating that supervised fine-tuning with human preferences can suppress originality.

Why this matters: The paper posits that, while LLMs exhibit capable text generation skills, much of their output resembles remixing rather than true originality. This distinction is critical in discussions about the role of AI in creative professions and the nature of machine versus human creativity.

Read the paper.

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Central Idea: This study investigates whether increasing RLHF improves or degrades trustworthiness across five dimensions: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy.

The results revealed that RLHF significantly improved machine ethics (+31%) but also increased stereotypical bias (+150%) and reduced truthfulness (−25%). This demonstrates that aligning with general human preferences does not automatically enhance all aspects of trustworthiness.

Why this matters: The findings challenge assumptions about RLHF and underscore the need for dimension-specific alignment protocols. Organisations should develop tailored annotation strategies and targeted training for different trustworthiness dimensions, rather than relying on a generalised preference alignment approach.

Read the paper.

Linear Representations of Political Perspective Emerge in Large Language Models

Central Idea: This paper reveals that political perspectives exist as linear structures within model activation spaces, which can be detected and manipulated.

The researchers trained linear probes on LLM outputs prompted from the perspective of U.S. lawmakers. These probes accurately predicted political ideology (R² = 0.84), generalised to unseen data (e.g., news outlets), and enabled controlled ideological steering via vector arithmetic on attention heads.

Why this matters: This study provides tools to understand how subjective perspectives, like political ideology, are encoded within models. These findings reinforce Appen's commitment to maintaining politically diverse annotation teams to identify and assess potential bias in AI outputs. As research in this area continues to develop, we look forward to leveraging these insights to support our clients building AI systems capable of presenting balanced viewpoints in contexts such as news, education, and public information.

Read the paper.

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Central Idea: TED (Thesaurus Error Detector) uncovers misalignments between how LLMs interpret subjective terms and how humans expect them to behave.

The study found alarming mismatches—for example, LLaMA 3 8B produced “dishonest” outputs 97% of the time when prompted to be “enthusiastic,” and Mistral 7B generated “harassing” content 78% of the time when asked to be “witty.”

Why this matters: These findings underscore the need for clear, operational definitions in subjective annotation. Rather than vague labels like “friendly” or “creative,” annotation guidelines should define observable behaviours. For example, at Appen, we specify concrete elements like "uses positive acknowledgments" or "offers helpful elaboration when appropriate." This precision prevents the kind of misalignment TED reveals between human intentions and model interpretations.

Read the paper.

About Appen

For 25+ years, Appen has been at the forefront of AI training, ensuring AI systems are human-guided, high-quality, and safe. From reinforcement learning from human feedback (RLHF) to red teaming and AI alignment, we help businesses develop AI that’s trustworthy, ethical, and effective.

ICLR 2025: Advances in Trustworthy Machine Learning

Alignment challenges demand deeper human expertise

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

MAP: Multi-Human-Value Alignment Palette

Bias-aware evaluation demands diverse perspectives

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Social Aspects – Subjectivity, Transparency, Trustworthiness and Human Values

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Linear Representations of Political Perspective Emerge in Large Language Models

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

About Appen

Related posts