Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.
I think this framing could be helpful, and I’m glad you raised it.
That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don’t rely on these threat-models, so I don’t take myself to have offered a list of sufficient conditions for ‘utilizing AI safely’. Likewise, I don’t think CP being true necessarily implies that we’re doomed (i.e., (DecepAlign⇒CP)⇏(CP⇒DecepAlign)).
Still, I think it’s fair to say that some of your “bad” suggestions are in fact bad, and that (e.g.) sufficiently long training-episodes are x-risk-factors.
Onto the other points.
If you allow complex off-task information to leak into the input from prior runs, you create the possibility of the model optimizing for both self generated goals (hidden in the prior output) and the current context. The self generated goals are consequentialist preferences.
I agree that this is possible. Though I feel unsure as to whether (and if so, why) you think AIs forming consequentialist preferences is likely, or plausible — help me out here?
You then raise an alternative threat-model.
Hostile actors can and will develop and release models without restrictions, with global context and online learning, that have spent centuries training in complex RL environments with hacking training. They will have consequentialist preferences and no episode time limit, with broad scope maximizing goals like (“’win the planet for the bad actors”)
I agree that this is a risk worth worrying about. But, two points.
I think the threat-model you sketch suggests a different set of interventions from threat-models like deceptive alignment and reward maximization – this post is solely focused on those two threat-models.
On my current view, I’d be happier if marginal ‘AI safety funding resources’ were devoted to misuse/structural risks (of the kind you describe) over misalignment risks.
If we don’t get “broad-scoped maximizing goals” by default, then I think this, at the very least, is promising evidence about the nature of the offense/defense balance.
I think this framing could be helpful, and I’m glad you raised it.
That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don’t rely on these threat-models, so I don’t take myself to have offered a list of sufficient conditions for ‘utilizing AI safely’. Likewise, I don’t think CP being true necessarily implies that we’re doomed (i.e., (DecepAlign⇒CP)⇏(CP⇒DecepAlign)).
Still, I think it’s fair to say that some of your “bad” suggestions are in fact bad, and that (e.g.) sufficiently long training-episodes are x-risk-factors.
Onto the other points.
I agree that this is possible. Though I feel unsure as to whether (and if so, why) you think AIs forming consequentialist preferences is likely, or plausible — help me out here?
You then raise an alternative threat-model.
I agree that this is a risk worth worrying about. But, two points.
I think the threat-model you sketch suggests a different set of interventions from threat-models like deceptive alignment and reward maximization – this post is solely focused on those two threat-models.
On my current view, I’d be happier if marginal ‘AI safety funding resources’ were devoted to misuse/structural risks (of the kind you describe) over misalignment risks.
If we don’t get “broad-scoped maximizing goals” by default, then I think this, at the very least, is promising evidence about the nature of the offense/defense balance.