Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
to escape
to create bioweapons
to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)
(This might not be because of internal motivations well described as deceptive alignment.)
(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
I expect to not see this, conditional on adding a stipulation like “the AI wasn’t scaffolded and then given a goal like ‘maximize profit’”, because I could imagine the AI-system coming up with nasty subgoals. In particular, I don’t expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.
The most clear case I can think of is:
Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
to escape
to create bioweapons
to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)
(This might not be because of internal motivations well described as deceptive alignment.)
(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
I expect to not see this, conditional on adding a stipulation like “the AI wasn’t scaffolded and then given a goal like ‘maximize profit’”, because I could imagine the AI-system coming up with nasty subgoals. In particular, I don’t expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.