Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn’t really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
basically stuff along the lines of “when you put agents in X situation, they tend to do Y thing”, rather than trying to understand latent causes / capabilities
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
I agree with Apollo Research that evals isn’t really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
When I wrote that, I wasn’t thinking so much about evals / model organisms as stuff like:
putting a bunch of agents in a simulated world and seeing how they interact
weak-to-strong / easy-to-hard generalization
basically stuff along the lines of “when you put agents in X situation, they tend to do Y thing”, rather than trying to understand latent causes / capabilities