I actually do think that a single instance of GPT-4 isn’t quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it’s not smart enough to accomplish them just by directly interacting with a terminal or the internet.
I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.
I actually do think that a single instance of GPT-4 isn’t quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it’s not smart enough to accomplish them just by directly interacting with a terminal or the internet.
I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.