Your making a mistake : “If you ask it for a plan to manufacture paperclips, it doesn’t think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn’t think the extermination of all biological life would count as a success.” When you ask GPT for such things, you’re not setting it’s internal goal. You’re just interracting with him. His internal goal might look like “Provide answer that are likely to satisfy the human , and write only potato if the human provide the exact string “s8r5d6s8rd4c8e” “ You have no idea what’s exactly it internal goal, but you’re just seeing that so far its providing satisfying answers. Such misalligned internal goals can still bite us especially if the there are specific outliers not present in the training dataset
My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.
Your making a mistake : “If you ask it for a plan to manufacture paperclips, it doesn’t think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn’t think the extermination of all biological life would count as a success.”
When you ask GPT for such things, you’re not setting it’s internal goal. You’re just interracting with him. His internal goal might look like “Provide answer that are likely to satisfy the human , and write only potato if the human provide the exact string “s8r5d6s8rd4c8e” “
You have no idea what’s exactly it internal goal, but you’re just seeing that so far its providing satisfying answers.
Such misalligned internal goals can still bite us especially if the there are specific outliers not present in the training dataset
My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.