I find myself unsure which conclusion this is trying to argue for.
Here are some pretty different conclusions:
Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it’s a problem after human obsolescence for our trusted AI successors, but who cares).
There aren’t any solid arguments for deceptive alignment[1]. So, we certainly shouldn’t be confident in deceptive alignment (e.g. >90%), though we can’t total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.
There is a big difference between <<1% likely and 10% likely. I basically agree with “not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment”, but I don’t think this leaves me in a <<1% likely epistemic state.
Closest to the third, but I’d put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.
I find myself unsure which conclusion this is trying to argue for.
Here are some pretty different conclusions:
Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it’s a problem after human obsolescence for our trusted AI successors, but who cares).
There aren’t any solid arguments for deceptive alignment[1]. So, we certainly shouldn’t be confident in deceptive alignment (e.g. >90%), though we can’t total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.
There is a big difference between <<1% likely and 10% likely. I basically agree with “not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment”, but I don’t think this leaves me in a <<1% likely epistemic state.
Other than noting that it could be behaviorally consistent for powerful models: powerful models are capable of deceptive alignment.
Closest to the third, but I’d put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.