I want to flag that the overall tone of the post is in tension with the dislacimer that you are “not putting forward a positive argument for alignment being easy”.
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as “your probability of scheming was >95%, and this was based basically only on this particular version of the ‘counting argument’ ”. That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it—then yes, you should strongly update. But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn’t mean our intuitions cannot be more or less detailed. It’s just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that “scheming is instrumental for a large class of goals” makes a huge contribution to my beliefs (of “something between 10% and 99% on alignment being hard”), while the particular version of the ‘counting argument’ that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.
I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: “So, we don’t have any rigorous arguments about AI risk being real or not, and we won’t have them for quite a while yet. Should we be super-careful about it, just in case?”. But I do think that is appropriate.
I want to flag that the overall tone of the post is in tension with the dislacimer that you are “not putting forward a positive argument for alignment being easy”.
To hint at what I mean, consider this claim:
I think this claim is only valid if you are in a situation such as “your probability of scheming was >95%, and this was based basically only on this particular version of the ‘counting argument’ ”. That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it—then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn’t mean our intuitions cannot be more or less detailed. It’s just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that “scheming is instrumental for a large class of goals” makes a huge contribution to my beliefs (of “something between 10% and 99% on alignment being hard”), while the particular version of the ‘counting argument’ that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.
I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: “So, we don’t have any rigorous arguments about AI risk being real or not, and we won’t have them for quite a while yet. Should we be super-careful about it, just in case?”. But I do think that is appropriate.