My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.
My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.