This was very interesting, looking forward to the follow up!
In the “AIs messing with your evaluations” (and checking for whether the AI is capable of/likely to do so) bit, I’m curious if there is any published research on this.
The closest existing thing I’m aware of is password locked models.
Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.
This was very interesting, looking forward to the follow up!
In the “AIs messing with your evaluations” (and checking for whether the AI is capable of/likely to do so) bit, I’m curious if there is any published research on this.
The closest existing thing I’m aware of is password locked models.
Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.