So to summarize your claim (check if I’m understanding correctly):
1. Character-level tokenization can lead to different results. - My answer: Yes and No. But mostly no. H-Test is not just any set of character manipulation tasks. - Explanation: Maybe some H-Test tasks can be affected by this. But how do you explain tasks like Repeated Word (one group has two repeated words) or End Punctuation (based on the location of the punctuation). Though this opinion is valid and is probably worthy of further investigation, it doesn’t disprove the full extent of our tests. Along similar lines, GPT-4 shows some of the most “jumping” performance improvement from GPT 3.5 in non-character-level tasks (Repeated Word: 0.505 → 0.98).
2. Scaling up will lead to better results. Since no other models tested were at the scale of GPT-4, that’s why they couldn’t solve H-Test. - My answer: No but it would be interesting if this turned out to be true. - Explanation: We tested 15 models from leading LLM labs before we arrived at our claim. If the H-Test was a “scaling task”, we would have observed at least some degree of performance improvement in other models like Luminous or LLaMA too. But no this was not the case. And the research that you linked doesn’t seem to devise a text-to-text setup to test this ability.
3. Memorization (aka more orthography-specific data) will lead to better results. - My answer: No. - Explanation: Our section 5 (Analysis: We Don’t Understand GPT-4) is in fact dedicated to disproving the claim that more orthography-specific data will help LLMs solve H-Test. In GPT-3.5-Turbo finetuning results on H-Test training set, we observed no significant improvement in performance. Before and after finetuning, the performance remains tightly centered around the random change baseline.
Maybe some H-Test tasks can be affected by this. But how do you explain tasks like Repeated Word (one group has two repeated words) or End Punctuation (based on the location of the punctuation).
I don’t think I need to. ‘End Punctuation’ sounds like it’s affected by tokenization, and regardless, artificial microbenchmarks like ‘Repeated Word’ are not expected to exhibit smooth scaling the way global losses like perplexity do. (They instead exhibit emergence, inverse U-scaling, and noisy patterns due to combined sampling error & biases from model checkpoints / sizes / test items / test sizes / prompts+formatting.) Look at Big-Bench to see how noisy these sorts of things are even when they are being properly evaluated in controlled conditions and sweeping model sizes (whereas your results are an uninterpretable hodge-podge).
Meanwhile, how do you explain the PaLM results on spelling miracles if you don’t believe in scaling and that these are tasks “language models don’t learn”?
We tested 15 models from leading LLM labs before we arrived at our claim. If the H-Test was a “scaling task”, we would have observed at least some degree of performance improvement in other models like Luminous or LLaMA too. But no this was not the case.
We see improvements from scaling all the time which start from a flatline and then increase at critical sizes. See ‘emergence’. Emergence is not that surprising because phase transitions are everywhere in NNs; and obviously, people don’t bother with creating benchmarks where all the LLMs are ~100%, and then the best model, GPT-4, has a chance to exhibit emergence. And, doubtless, we’ll see more examples with GPT-5 etc. (You also have a higher opinion of some of these ‘leading’ models like Luminous than I think most people do.)
Our section 5 (Analysis: We Don’t Understand GPT-4) is in fact dedicated to disproving the claim that more orthography-specific data will help LLMs solve H-Test. In GPT-3.5-Turbo finetuning results on H-Test training set, we observed no significant improvement in performance. Before and after finetuning, the performance remains tightly centered around the random change baseline.
Why would finetuning on a training set help a test set if GPT-3.5 is memorizing? Memorizing a pair of rhymes A/B tells you nothing about another pair of rhymes C/D, regardless of the two tasks being ‘in-domain’.
(By the way, I would be skeptical of any conclusions drawn from GPT-3.5 finetuning because even if the ‘finetuning’ seemed to work, who knows what that ‘finetuning’ mystery meat actually is? The first iteration of OA’s GPT-3 finetuning was apparently a fiasco, somehow whenever the rebooted OA GPT-3 finetuning comes up the result from it always seems to be ‘it doesn’t help capabilities’, and OA declines to explain in any detail what the ‘finetuning’ does.)
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
About 1.) That GPT4 performance jumps most in non-char tests seems to point towards two sources of difficulty in H-tests, with one being tokenization hiding char-level information.
About 2.) To me your results look completely consistent with scale solving H-tests. There are many benchmarks where a certain level has to be reached to leave random performance behind. For your benchmark that level is pretty high, but Claude and GPT4 seem to be above it.
If it’s not scale, what makes Claude and GPT4 capable of making a dent in your benchmark?
About 3.) Finetuning doesn’t convey enough information to completely revamp the representation of the spelling of different tokens. Finetuning mostly doesn’t teach models skills they don’t have. It instead points them squarely at the task they should be doing.
About 2.) I’m aware of the type of tasks that suddenly increase in performance at a certain scale, but it is rather challenging to confirm assertions about the emergence of capabilities at certain model scales. If I made a claim like “it seems that emergence happens at 1TB model size like GPT-4”, it would be misleading as there are too many compound variables in play. However, it would also be a false belief to claim that absolutely nothing happens at such an astronomical model size.
Our paper’s stance, phrased carefully (and hopefully firmly), is that larger models from the same family (e.g., LLaMA 2 13B to LLaMA 2 70B) don’t automatically lead to better H-Test performance. In terms of understanding GPT-4 performance (Analysis: We Don’t Understand GPT-4), we agreed that we should be blunt about why GPT-4 is performing so well due to too many compound variables.
As for Claude, we refrained from speculating about scale since we didn’t observe its impact directly. Given the lack of transparency about model sizes from AI labs, and considering other models in our study that performed on par with Claude on benchmarks like MMLU, we can’t attribute Claude’s 60% accuracy solely to scale. Even if we view this accuracy as more than marginal improvement, it suggests that Claude is doing something distinct, resulting in a greater boost on H-Test compared to what one might expect from scaling effects on other benchmarks.
About 3.) Fine-tuning can indeed be effective for prompting models to memorize information. In our study, this approach served as a useful proxy for testing the models’ ability to learn from orthography-specific data, without yielding substantial performance improvements on H-Test.
So to summarize your claim (check if I’m understanding correctly):
1. Character-level tokenization can lead to different results.
- My answer: Yes and No. But mostly no. H-Test is not just any set of character manipulation tasks.
- Explanation: Maybe some H-Test tasks can be affected by this. But how do you explain tasks like Repeated Word (one group has two repeated words) or End Punctuation (based on the location of the punctuation). Though this opinion is valid and is probably worthy of further investigation, it doesn’t disprove the full extent of our tests. Along similar lines, GPT-4 shows some of the most “jumping” performance improvement from GPT 3.5 in non-character-level tasks (Repeated Word: 0.505 → 0.98).
2. Scaling up will lead to better results. Since no other models tested were at the scale of GPT-4, that’s why they couldn’t solve H-Test.
- My answer: No but it would be interesting if this turned out to be true.
- Explanation: We tested 15 models from leading LLM labs before we arrived at our claim. If the H-Test was a “scaling task”, we would have observed at least some degree of performance improvement in other models like Luminous or LLaMA too. But no this was not the case. And the research that you linked doesn’t seem to devise a text-to-text setup to test this ability.
3. Memorization (aka more orthography-specific data) will lead to better results.
- My answer: No.
- Explanation: Our section 5 (Analysis: We Don’t Understand GPT-4) is in fact dedicated to disproving the claim that more orthography-specific data will help LLMs solve H-Test. In GPT-3.5-Turbo finetuning results on H-Test training set, we observed no significant improvement in performance. Before and after finetuning, the performance remains tightly centered around the random change baseline.
I don’t think I need to. ‘End Punctuation’ sounds like it’s affected by tokenization, and regardless, artificial microbenchmarks like ‘Repeated Word’ are not expected to exhibit smooth scaling the way global losses like perplexity do. (They instead exhibit emergence, inverse U-scaling, and noisy patterns due to combined sampling error & biases from model checkpoints / sizes / test items / test sizes / prompts+formatting.) Look at Big-Bench to see how noisy these sorts of things are even when they are being properly evaluated in controlled conditions and sweeping model sizes (whereas your results are an uninterpretable hodge-podge).
Meanwhile, how do you explain the PaLM results on spelling miracles if you don’t believe in scaling and that these are tasks “language models don’t learn”?
We see improvements from scaling all the time which start from a flatline and then increase at critical sizes. See ‘emergence’. Emergence is not that surprising because phase transitions are everywhere in NNs; and obviously, people don’t bother with creating benchmarks where all the LLMs are ~100%, and then the best model, GPT-4, has a chance to exhibit emergence. And, doubtless, we’ll see more examples with GPT-5 etc. (You also have a higher opinion of some of these ‘leading’ models like Luminous than I think most people do.)
Why would finetuning on a training set help a test set if GPT-3.5 is memorizing? Memorizing a pair of rhymes A/B tells you nothing about another pair of rhymes C/D, regardless of the two tasks being ‘in-domain’.
(By the way, I would be skeptical of any conclusions drawn from GPT-3.5 finetuning because even if the ‘finetuning’ seemed to work, who knows what that ‘finetuning’ mystery meat actually is? The first iteration of OA’s GPT-3 finetuning was apparently a fiasco, somehow whenever the rebooted OA GPT-3 finetuning comes up the result from it always seems to be ‘it doesn’t help capabilities’, and OA declines to explain in any detail what the ‘finetuning’ does.)
Thanks for the comment. I’ll get back to you sometime soon.
Before I come up with anything though, where are you getting to with your arguments? It would help me draft a better reply if I knew your ultimatum.
Where am I going? Nowhere complex.
About 1.) That GPT4 performance jumps most in non-char tests seems to point towards two sources of difficulty in H-tests, with one being tokenization hiding char-level information.
About 2.) To me your results look completely consistent with scale solving H-tests. There are many benchmarks where a certain level has to be reached to leave random performance behind. For your benchmark that level is pretty high, but Claude and GPT4 seem to be above it.
If it’s not scale, what makes Claude and GPT4 capable of making a dent in your benchmark?
About 3.) Finetuning doesn’t convey enough information to completely revamp the representation of the spelling of different tokens. Finetuning mostly doesn’t teach models skills they don’t have. It instead points them squarely at the task they should be doing.
About 1.) Agree with this duality argument.
About 2.) I’m aware of the type of tasks that suddenly increase in performance at a certain scale, but it is rather challenging to confirm assertions about the emergence of capabilities at certain model scales. If I made a claim like “it seems that emergence happens at 1TB model size like GPT-4”, it would be misleading as there are too many compound variables in play. However, it would also be a false belief to claim that absolutely nothing happens at such an astronomical model size.
Our paper’s stance, phrased carefully (and hopefully firmly), is that larger models from the same family (e.g., LLaMA 2 13B to LLaMA 2 70B) don’t automatically lead to better H-Test performance. In terms of understanding GPT-4 performance (Analysis: We Don’t Understand GPT-4), we agreed that we should be blunt about why GPT-4 is performing so well due to too many compound variables.
As for Claude, we refrained from speculating about scale since we didn’t observe its impact directly. Given the lack of transparency about model sizes from AI labs, and considering other models in our study that performed on par with Claude on benchmarks like MMLU, we can’t attribute Claude’s 60% accuracy solely to scale. Even if we view this accuracy as more than marginal improvement, it suggests that Claude is doing something distinct, resulting in a greater boost on H-Test compared to what one might expect from scaling effects on other benchmarks.
About 3.) Fine-tuning can indeed be effective for prompting models to memorize information. In our study, this approach served as a useful proxy for testing the models’ ability to learn from orthography-specific data, without yielding substantial performance improvements on H-Test.