Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i’d guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.
Nit:
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming—llama70b is about as good as gpt-3.5. And I’d personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass
Good catch, since the context from LLMs is performance in one forward pass, the claim should be about that, and I’m not sure it’s superhuman without MCTS. I think the intended point survives this mistake, that is it’s a much smaller model than modern LLMs that has relatively very impressive performance primarily because of high quality of the synthetic dataset it effectively trains on. Thus models at the scale of near future LLMs will have a reality-warping amount of dataset quality overhang. This makes ability of LLMs to improve datasets much more impactful than their competence at other tasks, hence the anchors of capability I was pointing out were about labeling and rearranging data according to instructions. And also makes compute threshold gated regulation potentially toothless.
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2.
With Chinchilla scaling, compute is square of model size, so 2 OOMs under that assumption. Of course current 7b models are overtrained compared to Chinchilla (all sizes of LLaMA-2 are trained on 2T tokens), which might be your point. And Mistral-7b is less obviously a whole step below LLaMA-2-70b, so the full-step-of-improvement should be about earlier 7b models more representative of how the frontier of scaling advances, where a Chinchilla-like tradeoff won’t yet completely break down, probably preserving data squared compute scaling estimate (parameter count no longer works very well as an anchor with all the MoE and sparse pre-training stuff). Not clear what assumptions make it 1.5 OOMs instead of either 1 or 2, possibly Chinchilla-inefficiency of overtraining?
And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming—llama70b is about as good as gpt-3.5.
I was going from EpochAI estimate that puts LLaMA 2 at 8e23 FLOPs and GPT-4 at 2e25 FLOPs, which is 1.4 OOMs. I’m thinking of effective compute in terms of compute necessary for achieving the same pre-training loss (using lower amount of literal compute with pre-training algorithmic improvement), not in terms of meaningful benchmarks for fine-tunes. In this sense overtrained smaller LLaMAs get even less effective compute than literal compute, since they employ it to get loss Chinchilla-inefficiently. We can then ask the question of how much subjective improvement a given amount of pre-training loss scaling (in terms of effective compute) gets us. It’s not that useful in detail, but gives an anchor for improvement from scale alone in the coming years, before industry and economy force a slowdown (absent AGI): It goes beyond GPT-4 about as far as GPT-4 is beyond LLaMA-2-13b.
This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while “master level” is at least 2200 elo and “grandmaster level” starts at 2500 elo.
Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for public use can achieve a strength of possibly as high as 6d without MCTS. That would be somewhat weaker than a professional player, but not by much. Still far off from “grandmaster level” though.
According to figure 6b in “Mastering the Game of Go without Human Knowledge”, the raw policy network has 3055 elo, which according to this other page (I have not checked that these Elos are comparable) makes it the 465th best player. (I don’t know much about this and so might be getting the inferences wrong, hopefully the facts are useful)
I don’t think those ratings are comparable. On the other hand, my estimate of 3d was apparently lowballing it based on some older policy networks, and newer ones are perhaps as strong as 4d to 6d, which on the upper end is still weaker than professional players but not by much.
However, there is a big gap between weak professional players and “grandmaster level”, and I don’t think the raw policy network of AlphaGo could play competitively against a grandmaster level Go player.
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i’d guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.
Nit:
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming—llama70b is about as good as gpt-3.5. And I’d personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.
Good catch, since the context from LLMs is performance in one forward pass, the claim should be about that, and I’m not sure it’s superhuman without MCTS. I think the intended point survives this mistake, that is it’s a much smaller model than modern LLMs that has relatively very impressive performance primarily because of high quality of the synthetic dataset it effectively trains on. Thus models at the scale of near future LLMs will have a reality-warping amount of dataset quality overhang. This makes ability of LLMs to improve datasets much more impactful than their competence at other tasks, hence the anchors of capability I was pointing out were about labeling and rearranging data according to instructions. And also makes compute threshold gated regulation potentially toothless.
With Chinchilla scaling, compute is square of model size, so 2 OOMs under that assumption. Of course current 7b models are overtrained compared to Chinchilla (all sizes of LLaMA-2 are trained on 2T tokens), which might be your point. And Mistral-7b is less obviously a whole step below LLaMA-2-70b, so the full-step-of-improvement should be about earlier 7b models more representative of how the frontier of scaling advances, where a Chinchilla-like tradeoff won’t yet completely break down, probably preserving data squared compute scaling estimate (parameter count no longer works very well as an anchor with all the MoE and sparse pre-training stuff). Not clear what assumptions make it 1.5 OOMs instead of either 1 or 2, possibly Chinchilla-inefficiency of overtraining?
I was going from EpochAI estimate that puts LLaMA 2 at 8e23 FLOPs and GPT-4 at 2e25 FLOPs, which is 1.4 OOMs. I’m thinking of effective compute in terms of compute necessary for achieving the same pre-training loss (using lower amount of literal compute with pre-training algorithmic improvement), not in terms of meaningful benchmarks for fine-tunes. In this sense overtrained smaller LLaMAs get even less effective compute than literal compute, since they employ it to get loss Chinchilla-inefficiently. We can then ask the question of how much subjective improvement a given amount of pre-training loss scaling (in terms of effective compute) gets us. It’s not that useful in detail, but gives an anchor for improvement from scale alone in the coming years, before industry and economy force a slowdown (absent AGI): It goes beyond GPT-4 about as far as GPT-4 is beyond LLaMA-2-13b.
Iirc, original alphago had a policy network that was grandmaster level but not superhuman without MCTS.
This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while “master level” is at least 2200 elo and “grandmaster level” starts at 2500 elo.
Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for public use can achieve a strength of possibly as high as 6d without MCTS. That would be somewhat weaker than a professional player, but not by much. Still far off from “grandmaster level” though.
According to figure 6b in “Mastering the Game of Go without Human Knowledge”, the raw policy network has 3055 elo, which according to this other page (I have not checked that these Elos are comparable) makes it the 465th best player. (I don’t know much about this and so might be getting the inferences wrong, hopefully the facts are useful)
I don’t think those ratings are comparable. On the other hand, my estimate of 3d was apparently lowballing it based on some older policy networks, and newer ones are perhaps as strong as 4d to 6d, which on the upper end is still weaker than professional players but not by much.
However, there is a big gap between weak professional players and “grandmaster level”, and I don’t think the raw policy network of AlphaGo could play competitively against a grandmaster level Go player.