4 pieces selected from AI Alignment Forum, The Gradient — only the ones worth your time.
1. Tracing Eval-Awareness Emergence Through Training of OLMo 3
AI Alignment Forum
TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and OLMo-3.1-32B-Think – identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage – VEA roughly doubles. Because
Key takeaways
2. Models May Behave Worse When Eval Aware
AI Alignment Forum
This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. TL;DR It's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in behavioural evals even when it explicitly reasons that the environments are contrived, a
Key takeaways
3. After Orthogonality: Virtue-Ethical Agency and AI Alignment
The Gradient
This essay challenges the conventional goal-oriented approach to AI alignment, arguing that human rationality is not driven by fixed goals but by alignment with practices—networks of actions, dispositions, and evaluation criteria. The author proposes that AI systems should similarly be designed to align with ethical practices rather than rigid goals, drawing on virtue ethics to frame AI behavior as contextually adaptive and morally grounded.
Why it matters
Developers building AI products often rely on goal-based frameworks, which can lead to misalignment with human values. This essay provides a philosophical foundation for designing AI systems that adapt to ethical practices, offering a more flexible and human-like approach to alignment that could improve real-world applicability and safety.
What you can build with this
Design an AI agent that evaluates and adapts its actions based on a set of ethical practices (e.g., fairness, transparency) rather than fixed goals. Implement a prototype using reinforcement learning where the reward function is dynamically adjusted based on adherence to these practices.
Key takeaways
- Human rationality is better understood as alignment with practices rather than pursuit of fixed goals.
- AI systems should be designed to align with ethical practices, not just predefined objectives.
- Virtue ethics provides a framework for creating contextually adaptive and morally grounded AI behavior.
4. Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research
The Gradient
The essay examines the evolving role of mathematics in machine learning research over the past decade. It highlights a shift from mathematically principled architectures to compute-intensive, engineering-driven approaches that prioritize scaling to larger training sets. The authors argue that while mathematical foundations were crucial in early ML developments, recent progress has been driven more by empirical scaling and engineering efforts than by theoretical advancements.
Why it matters
For developers building AI products, this shift underscores the importance of focusing on scalable engineering solutions and leveraging large datasets, rather than solely relying on theoretical innovations. It suggests that practical, compute-driven approaches may yield more significant improvements in real-world applications than purely mathematical refinements.
What you can build with this
This week, start a project that compares the performance of a mathematically optimized model against a scaled-up, compute-intensive version of a simpler model on a large dataset. Use this to evaluate the trade-offs between theoretical elegance and empirical scaling in your specific use case.
Key takeaways
- Mathematical principles were foundational in early ML but are now less dominant in driving progress.
- Compute-intensive, engineering-first approaches are currently leading to more significant improvements in ML.
- Scaling to larger datasets often yields better results than refining mathematical architectures.