I think this hypothesis does not deserve scorn. However I think it is pretty unlikely, less than 1%. My main argument would be that the training process for GPT-4 did not really incentivize such coherent strategic/agentic behavior. (LMK if I’m wrong about that!) My secondary argument would be that ARC did have fine-tuning access to smaller models (IIRC, maybe I’m misremembering) and they still weren’t able to ‘pull themselves together’ enough to pass the eval. They haven’t fine-tuned GPT-4 yet but assuming the fine-tuned GPT-4 continues to fail the eval, that would be additional evidence that the capability to pass just doesn’t exist yet (as opposed to, it exists but GPT-4 is cleverly resisting exercising it). I guess my tertiary argument is AGI timelines—I have pretty short timelines but I think a world where GPT-4 was already coherently agentic would look different from today’s world, probably; the other AI projects various labs are doing would be working better by now.
I actually do think that a single instance of GPT-4 isn’t quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it’s not smart enough to accomplish them just by directly interacting with a terminal or the internet.
I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.
I think this hypothesis does not deserve scorn. However I think it is pretty unlikely, less than 1%. My main argument would be that the training process for GPT-4 did not really incentivize such coherent strategic/agentic behavior. (LMK if I’m wrong about that!) My secondary argument would be that ARC did have fine-tuning access to smaller models (IIRC, maybe I’m misremembering) and they still weren’t able to ‘pull themselves together’ enough to pass the eval. They haven’t fine-tuned GPT-4 yet but assuming the fine-tuned GPT-4 continues to fail the eval, that would be additional evidence that the capability to pass just doesn’t exist yet (as opposed to, it exists but GPT-4 is cleverly resisting exercising it). I guess my tertiary argument is AGI timelines—I have pretty short timelines but I think a world where GPT-4 was already coherently agentic would look different from today’s world, probably; the other AI projects various labs are doing would be working better by now.
I actually do think that a single instance of GPT-4 isn’t quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it’s not smart enough to accomplish them just by directly interacting with a terminal or the internet.
I agree that IF GPT-4 had goals, it could work towards them even if it couldn’t pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That’s my secondary argument though; my main argument is just that its training process probably didn’t incentivize it to have (long-term) goals. I’d ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like “what can I do output to get more paperclips?” and “is my current best guess output leaving paperclips on the table, so to speak?” etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of “what is the most likely continuation of this text?” on the other hand, seems pretty useful for lowering training loss, so probably it’s got lots of circuitry like that.