6 pieces selected from AI Alignment Forum, The Gradient — only the ones worth your time.
1. Looking for backdoors in Jane Street LLMs
AI Alignment Forum
I am going to talk about my experience in the Jane Street LLM backdoor challenge. I am sharing partial results. I managed to crack some of the models using white-box methods, after the activation/prompting approach didn't pan out. Happy to discuss better or more promising approaches. Introduction A few months ago a Dwarkesh Patel podcast episode advertised a Jane Street backdoor challenge:
We've
Key takeaways
2. Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
AI Alignment Forum
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situat
Key takeaways
3. The Case for Evaluating Model Behaviors
AI Alignment Forum
Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on. From a safety perspective, capability evaluations have a place: by understanding how close we are to different capabilities, and the rate of progress on them, we can forecast when different risks are likely to occur, as well as the
Key takeaways
4. Full automation of AI R&D probably yields a large speed up even without a software-only singularity
AI Alignment Forum
This is a somewhat technical note. By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement i
Key takeaways
5. After Orthogonality: Virtue-Ethical Agency and AI Alignment
The Gradient
Preface This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices[1]: networks of actions, action-dispositions, action-evaluation criteria,
Key takeaways
6. Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research
The Gradient
What is the Role of Mathematics in Modern Machine Learning? The past decade has witnessed a shift in how progress is made in machine learning. Research involving carefully designed and mathematically principled architectures result in only marginal improvements while compute-intensive and engineering-first efforts that scale to ever larger training sets
Key takeaways