I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.
I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.