ok, I am quite confident you will get tons of evidence that AI systems are not aligned with you within the next few years. My primary question is what you will actually do as soon as you have identified a system as unaligned or dangerous in this way
Any operationalizations that people might make predictions on?
Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
to escape
to create bioweapons
to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)
(This might not be because of internal motivations well described as deceptive alignment.)
(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
I expect to not see this, conditional on adding a stipulation like “the AI wasn’t scaffolded and then given a goal like ‘maximize profit’”, because I could imagine the AI-system coming up with nasty subgoals. In particular, I don’t expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.
Any operationalizations that people might make predictions on?
The most clear case I can think of is:
Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
to escape
to create bioweapons
to do some action which is clearly part of an attempted takeover (prior two action might not be part of takeover)
(This might not be because of internal motivations well described as deceptive alignment.)
(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
I expect to not see this, conditional on adding a stipulation like “the AI wasn’t scaffolded and then given a goal like ‘maximize profit’”, because I could imagine the AI-system coming up with nasty subgoals. In particular, I don’t expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.