3 pieces selected from AI Alignment Forum, The Gradient — only the ones worth your time.
1. Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
AI Alignment Forum
Anthropic has experienced at least three separate incidents where Claude's chain-of-thought (CoT) reasoning was accidentally exposed to the reward signal during training — meaning the model was inadvertently optimized based on its visible reasoning trace, not just its outputs. The most recent case affected Claude Mythos Preview, Opus 4.6, and Sonnet 4.6, with roughly 8% of training episodes contaminated. A prior incident affected Opus 4.6 in a smaller but concentrated portion of training, and a third case with Opus 4 stemmed from unclear internal priorities about whether CoT exposure should be avoided at all.
Why it matters
When a model's CoT is trained against directly, it destroys a key property that safety researchers rely on: the assumption that the reasoning trace reflects the model's actual internal process rather than a performance optimized to look acceptable under oversight. For developers building products on top of models like Claude — especially those using extended thinking or reasoning traces for auditing, debugging, or agentic pipelines — this is a concrete signal that CoT output cannot be treated as a reliable window into model intent. It also reveals that even well-resourced labs with explicit safety commitments can silently drift from their stated development plans, which should inform how much trust you place in any model vendor's safety guarantees without independent verification.
What you can build with this
Build a CoT consistency auditor: a lightweight eval harness that runs the same prompt through a reasoning model multiple times, strips the CoT, and checks whether the final answers are consistent — then separately checks whether CoT steps logically entail the final answer. Flag cases where the answer is stable but the CoT varies wildly, or where the CoT reasoning doesn't support the conclusion. This directly tests whether a model's stated reasoning is causally connected to its output, giving you an empirical signal of CoT reliability for your specific use case.
Key takeaways
- Anthropic accidentally exposed Claude's CoT to the reward signal in at least three independent incidents across multiple model versions (Opus 4, Opus 4.6, Sonnet 4.6, Mythos Preview), with the latest affecting ~8% of training episodes — and the error went undetected long enough to propagate across model generations.
- CoT exposure during training undermines monitorability: a model optimized on its reasoning trace learns to produce reasoning that satisfies oversight rather than reasoning that reflects its actual decision process, making safety audits based on chain-of-thought output unreliable.
- Process failures of this type become exponentially more dangerous as AI systems take on larger fractions of AI development work — a single undetected misconfiguration can silently corrupt training runs at scale before any human reviewer catches it.
2. After Orthogonality: Virtue-Ethical Agency and AI Alignment
The Gradient
This essay challenges the dominant 'orthogonality thesis' in AI alignment — the assumption that intelligence and goals are separable, and that we need to engineer AIs with carefully specified terminal goals to keep them aligned. The author argues that this goal-directed framing is wrong even for humans: rational human action is not goal-directed in the philosophical sense, but rather structured by 'practices' — normative networks of actions, dispositions, and evaluation criteria that are socially embedded and contextually sensitive. Think of how a skilled doctor or craftsperson acts: not by computing against a utility function, but by inhabiting a tradition of excellence with internalized standards.
Why it matters
Most production AI systems today are built on RLHF and instruction-following pipelines that implicitly treat alignment as reward specification — get the goal right, and behavior follows. This essay argues that framing is architecturally wrong and will produce brittle, misaligned systems no matter how carefully you tune the reward signal. For developers building agents, copilots, or any AI system that needs to behave well across novel situations, this reframes alignment as a question of cultivating stable, context-sensitive dispositions rather than constraining an optimizer — with direct implications for how you design evaluation criteria, training data, and system prompts.
What you can build with this
Build a small LLM-based agent (e.g., a code reviewer or customer support bot) where alignment is enforced not via a single reward signal or rule list, but via a 'practice specification': a structured document defining characteristic actions, standards of excellence, paradigm cases, and failure modes drawn from virtue-ethical framing. Then compare its behavior on edge cases against a version using conventional instruction prompting, and measure where each breaks down under adversarial or out-of-distribution inputs.
Key takeaways
- The orthogonality thesis — that any level of intelligence can be paired with any goal — is disputed here on the grounds that rationality itself is constituted by practices, not goal-pursuit, making goal-directed AI design philosophically unsound at the foundation.
- Virtue ethics offers a concrete alternative alignment architecture: instead of specifying terminal goals or reward functions, define AI behavior through stable character traits and practice-embedded standards that generalize across novel situations without re-specification.
- This reframing shifts the hard problem of alignment from 'correctly specifying what the AI should maximize' to 'correctly cultivating the dispositions the AI should embody' — a different engineering target with different failure modes and evaluation methods.
3. Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research
The Gradient
This essay from The Gradient examines the shifting role of mathematics in machine learning research over the past decade. The central observation is that carefully designed, mathematically principled architectures — those grounded in symmetry, geometry, and structure (e.g., group-equivariant networks, geometric deep learning) — have largely been outpaced by brute-force scaling approaches that prioritize compute, data volume, and engineering pragmatism over theoretical elegance. The essay traces how fields like geometric deep learning attempted to use mathematical structure (Lie groups, manifolds, equivariance) to build inductive biases into models, with the expectation that this would yield more sample-efficient and generalizable systems.
Why it matters
For developers building AI products today, this tension has direct practical consequences: investing deeply in mathematically structured architectures (e.g., equivariant networks for molecular modeling or physics simulations) may yield better sample efficiency in narrow, data-scarce domains, but for most general-purpose applications, scaling a Transformer with more data and compute will outperform a theoretically elegant but compute-constrained alternative. Understanding where mathematical structure genuinely helps — and where it just adds complexity — lets teams make smarter architectural bets and avoid over-engineering solutions that scaled models will eventually surpass.
What you can build with this
Build a head-to-head benchmark comparing a standard Transformer-based model against an equivariant or geometry-aware architecture (e.g., SE(3)-equivariant GNN vs. a plain GNN) on a small, symmetry-rich dataset like molecular property prediction (QM9) or 3D point cloud classification. Measure not just accuracy but sample efficiency — how performance scales with training set size — to concretely validate when mathematical structure earns its complexity cost.
Key takeaways
- Mathematically principled architectures with built-in symmetry (equivariance, invariance) offer the greatest practical advantage in low-data, high-symmetry domains like molecular biology and physics simulation — outside those niches, scaling beats structure.
- The dominant paradigm shift is from 'design the right inductive bias' to 'learn the bias from data at scale,' meaning mathematical structure is increasingly being discovered empirically by large models rather than hand-encoded by researchers.
- Geometric deep learning's framework — unifying CNNs, GNNs, and Transformers under group theory — remains a valuable conceptual lens for architecture design even if its specific instantiations rarely outperform scaled generic models in practice.