ryan_greenblatt comments on What’s up with “Responsible Scaling Policies”?

ryan_greenblatt 31 Oct 2023 4:52 UTC
2 points
0
The most clear case I can think of is:

Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
- to escape
- to create bioweapons
- to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)
(This might not be because of internal motivations well described as deceptive alignment.)

(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
- TurnTrout 6 Nov 2023 22:58 UTC
  2 points
  2
  Parent
  I expect to not see this, conditional on adding a stipulation like “the AI wasn’t scaffolded and then given a goal like ‘maximize profit’”, because I could imagine the AI-system coming up with nasty subgoals. In particular, I don’t expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.