The closest existing thing I’m aware of is password locked models.
Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.
The closest existing thing I’m aware of is password locked models.
Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.