I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
I mostly work on evals, but I am also interested in interpretability. My goal is to improve our understanding of scheming and build tools and methods to detect it.
I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.
For more see https://www.mariushobbhahn.com/aboutme/
I subscribe to Crocker’s Rules
Thanks. I agree that this is a weak part of the post.
After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on “chaotic catastrophes”, e.g. something like:
1. Things move really fast.
2. We don’t really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways.
4. “what failure looks like” type loss of control.