Thanks again for these newsletters and summaries! I’m excited about the flagship paper.
First comment: I don’t think their experiment about code execution is much evidence re “true understanding.”
Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I’m not sure. https://aiimpacts.org/rate-of-neuron-firing/#Maximum_neural_firing_rates
So, the relevant comparison should be: Give a human the same test. Show them some code and give them 1 second to respond with an answer (or the first token of an answer, and then 1 second for the second token, and so forth). See how well they do at predicting the code output. I predict that they’d also do poorly, probably <50% accuracy. In claim that this passage from the paper inadvertently supports my hypothesis:
Including test cases and natural language descriptions in the prompt lead to the highest overall performance—higher than using the code itself. Because the code unambiguously describes the semantics, whereas test cases do not, this suggests that models are in some sense not really “reading” the source code and using it to execute. Models trained on general text corpora may be better at inducing patterns from as few as two input-output examples than they are at predicting the execution of code.
Second comment: Speculation about scaling trends:
Extrapolating from Figure 3, it seems that an AI which can solve (via at least one sample) approximately 100% of coding tasks in this set, without even needing fine-tuning, will require +2 OOMs of parameters, which would probably cost about $5B to train when you factor in the extra data required but also the lower prices and algorithmic improvements since GPT-3. Being almost 2 OOMs bigger than GPT-3, it might be expected to cost $6 per 1000 tokens, which would make it pretty expensive to use (especially if you wanted to use it at full-strength where it makes multiple samples and then picks the best one) though I think it might still find an economic niche; you could have a system where first a smaller model attempts a solution and you only call up the big model if that fails, and then you keep generating samples till you get one that works so on average the number of samples you need to generate will be small, and only cost you multiple dollars for a the toughest few percentile of cases. Then this service could be used by well-paid programmers for whom the time savings are worth it.
First comment: I don’t think their experiment about code execution is much evidence re “true understanding.”
I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not “understand” the code.
(Idk if you were trying to argue something else with the comparison, but I don’t think it’s clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.)
Second comment: Speculation about scaling trends:
I didn’t check the numbers, but that seems pretty reasonable. I think there’s a question of whether it actually saves time in the current format—it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.
I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not “understand” the code.
Haha, good point—yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn’t count as evidence that e.g. “current methods are fundamentally limited” or “artificial neural nets can’t truly understand concepts in the ways humans can” or “what goes on inside ANN’s is fundamentally a different kind of cognition from what goes on inside biological neural nets” or whatnot.
Thanks again for these newsletters and summaries! I’m excited about the flagship paper.
First comment: I don’t think their experiment about code execution is much evidence re “true understanding.”
Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I’m not sure. https://aiimpacts.org/rate-of-neuron-firing/#Maximum_neural_firing_rates
So, the relevant comparison should be: Give a human the same test. Show them some code and give them 1 second to respond with an answer (or the first token of an answer, and then 1 second for the second token, and so forth). See how well they do at predicting the code output. I predict that they’d also do poorly, probably <50% accuracy. In claim that this passage from the paper inadvertently supports my hypothesis:
Second comment: Speculation about scaling trends:
Extrapolating from Figure 3, it seems that an AI which can solve (via at least one sample) approximately 100% of coding tasks in this set, without even needing fine-tuning, will require +2 OOMs of parameters, which would probably cost about $5B to train when you factor in the extra data required but also the lower prices and algorithmic improvements since GPT-3. Being almost 2 OOMs bigger than GPT-3, it might be expected to cost $6 per 1000 tokens, which would make it pretty expensive to use (especially if you wanted to use it at full-strength where it makes multiple samples and then picks the best one) though I think it might still find an economic niche; you could have a system where first a smaller model attempts a solution and you only call up the big model if that fails, and then you keep generating samples till you get one that works so on average the number of samples you need to generate will be small, and only cost you multiple dollars for a the toughest few percentile of cases. Then this service could be used by well-paid programmers for whom the time savings are worth it.
Does this extrapolation/speculation seem right?
I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not “understand” the code.
(Idk if you were trying to argue something else with the comparison, but I don’t think it’s clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.)
I didn’t check the numbers, but that seems pretty reasonable. I think there’s a question of whether it actually saves time in the current format—it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.
Haha, good point—yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn’t count as evidence that e.g. “current methods are fundamentally limited” or “artificial neural nets can’t truly understand concepts in the ways humans can” or “what goes on inside ANN’s is fundamentally a different kind of cognition from what goes on inside biological neural nets” or whatnot.
Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I’d guess the authors would agree).