6 pieces selected from AI Alignment Forum, The Gradient — only the ones worth your time.
1. Advice for making robust-to-training model organisms
AI Alignment Forum
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't d
Key takeaways
2. Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
AI Alignment Forum
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situat
Key takeaways
3. Testing Gemini models for scheming tendencies
AI Alignment Forum
As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for scheming: when models are deployed as coding agen
Key takeaways
4. Full automation of AI R&D probably yields a large speed up even without a software-only singularity
AI Alignment Forum
This is a somewhat technical note. By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement i
Key takeaways
5. After Orthogonality: Virtue-Ethical Agency and AI Alignment
The Gradient
Preface This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices[1]: networks of actions, action-dispositions, action-evaluation criteria,
Key takeaways
6. Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research
The Gradient
What is the Role of Mathematics in Modern Machine Learning? The past decade has witnessed a shift in how progress is made in machine learning. Research involving carefully designed and mathematically principled architectures result in only marginal improvements while compute-intensive and engineering-first efforts that scale to ever larger training sets
Key takeaways