I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.
I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.
For more see https://www.mariushobbhahn.com/aboutme/
I subscribe to Crocker’s Rules
In one of my MATS projects we found that some models have a bias to think they’re always being evaluated, including in real-world scenarios. The paper isn’t public yet. But it seems like a pretty brittle belief that the models don’t hold super strongly. I think this can be part of a strategy, but should never be load-bearing.