There can be zero doubt that GPT-4 is better than GPT-3–but also I cannot imagine how we are supposed to achieve ethical and safety “alignment” with a system that cannot understand the word “third” even with billions of training examples.
He refers to the test questions about the third words and letter, etc. I think in that case errors stem from the GPT4 ’s weakness with low-level properties of character strings, not from it’s weakness with numbers. If you ask it about “What is the third digit of the third three-digit prime?” it will answer correctly (ChatGPT won’t).
The word-count domain is an odd one because the clear whitespace separation means that it doesn’t look like it should be a BPE artifact, which is my go-to explanation for these sorts of things. My best guess at present is that it may be a sparsity artifact which manifests here because there’s too few natural instances of such references to train the low-level layers to automatically preserve ordinal/count metadata about individual words up enough levels that their relevance becomes clear.
GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.
This was my impression too, and I’m glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs?
Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.
Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models?
Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?
I always get annoyed when people use this as an example of ‘lacking intelligence’. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It’s for the same reason that the base GPT-3 struggled so much with rhyming for instance.
Interesting tweet from Marcus 2 days ago:
He refers to the test questions about the third words and letter, etc. I think in that case errors stem from the GPT4 ’s weakness with low-level properties of character strings, not from it’s weakness with numbers.
If you ask it about “What is the third digit of the third three-digit prime?” it will answer correctly (ChatGPT won’t).
The word-count domain is an odd one because the clear whitespace separation means that it doesn’t look like it should be a BPE artifact, which is my go-to explanation for these sorts of things. My best guess at present is that it may be a sparsity artifact which manifests here because there’s too few natural instances of such references to train the low-level layers to automatically preserve ordinal/count metadata about individual words up enough levels that their relevance becomes clear.
GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.
This was my impression too, and I’m glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs?
Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.
Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?
I always get annoyed when people use this as an example of ‘lacking intelligence’. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It’s for the same reason that the base GPT-3 struggled so much with rhyming for instance.
Independently from the root causes of the issue, I am still very reluctant to define “superintelligent” something that cannot reliably count to three.
What is interesting about this tweet? That Marcus turns to the alignment problem?
I’m confused. Here’s a conversation I just had with GPT-4, with prompts in italics:
This part is indeed wrong. The third word of that sentence is “the”, not “third” as GPT4 claims.
That was arguably the hardest task, because it involved multi-step reasoning. Notably, I didn’t even notice that GPT-4′s response was wrong.