Ariel G. comments on What’s up with “Responsible Scaling Policies”?

Ariel G. 29 Oct 2023 16:37 UTC
1 point
0
This was very interesting, looking forward to the follow up!

In the “AIs messing with your evaluations” (and checking for whether the AI is capable of/likely to do so) bit, I’m curious if there is any published research on this.
- ryan_greenblatt 29 Oct 2023 17:21 UTC
  8 points
  2
  Parent
  The closest existing thing I’m aware of is password locked models.
  
  Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
  
  It also wouldn’t surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.