ryan_greenblatt comments on What’s up with “Responsible Scaling Policies”?

ryan_greenblatt 29 Oct 2023 17:21 UTC
8 points
2
The closest existing thing I’m aware of is password locked models.

Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.

It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.