I don’t think any of the AG-related papers specify the disk size of the model; they may specify total # of parameters somewhere but if so, I don’t recall offhand. It should be possible to estimate from the described model architecture by multiplying out all of the convolutions by strides/channels/etc but that’s pretty tricky and easy to get wrong.
I once loosely estimated using the architecture on R.J. Lipton’s blog when he asked the same question that the AZ model is probably somewhere ~300MB. So, large but not unusually so.
However, as I point out, if you are interested in interpreting that in an information-theoretic sense, you have to ask whether model compression/distillation/sparsification is relevant. The question of why NNs are so overparameterized, aside from being extremely important to AI risk and the hardware overhang question, is a pretty interesting one. There is an enormous literature (some of which I link here) showing an extreme range of size decreases/speed increases, with 10x being common and 100x not impossible depending on details like how much accuracy you want to give up. (For AZ, you could probably get 10x with no visible impact on ELO, but if you were willing to search another ply or two at runtime, perhaps you could get another order? It’s a tradeoff: the bigger the model, the higher the value function accuracy & less search it needs to achieve a target ELO strength.)
But is that fair? After all, you can’t learn that small neural network in the first place except by first passing through the very large one (as far as anyone knows). Similarly, with DNA, you have enormous ranges of genome sizes for no good apparent reason even among closely related species and viruses demonstrate that you can get absurd compression out of DNA by overlapping genes or reading them backwards (among other insane tricks), but such minified genomes may be quite fragile and such junk DNA and chromosomal or whole-genome duplications often lead to big genetic changes and adaptations and speciations, so all that fat may be serving evolvability or robustness purposes. Like NNs, maybe you can only get that hyper-specialized efficient genome after passing through a much larger overparameterized genome. (Viruses, then, may get away with such tiny genomes by optimizing for relatively narrow tasks, and applying extraordinary replication & mutation rates, and outsourcing as much as they can to regular cells or other viruses or other copies of themselves, like ‘multipartite viruses’. And even then, some viruses will have huge genomes.) https://slatestarcodex.com/2020/05/12/studies-on-slack/ and https://www.gwern.net/Backstop and https://www.gwern.net/Hydrocephalus might be relevant reading here.
I don’t think any of the AG-related papers specify the disk size of the model; they may specify total # of parameters somewhere but if so, I don’t recall offhand. It should be possible to estimate from the described model architecture by multiplying out all of the convolutions by strides/channels/etc but that’s pretty tricky and easy to get wrong.
I once loosely estimated using the architecture on R.J. Lipton’s blog when he asked the same question that the AZ model is probably somewhere ~300MB. So, large but not unusually so.
However, as I point out, if you are interested in interpreting that in an information-theoretic sense, you have to ask whether model compression/distillation/sparsification is relevant. The question of why NNs are so overparameterized, aside from being extremely important to AI risk and the hardware overhang question, is a pretty interesting one. There is an enormous literature (some of which I link here) showing an extreme range of size decreases/speed increases, with 10x being common and 100x not impossible depending on details like how much accuracy you want to give up. (For AZ, you could probably get 10x with no visible impact on ELO, but if you were willing to search another ply or two at runtime, perhaps you could get another order? It’s a tradeoff: the bigger the model, the higher the value function accuracy & less search it needs to achieve a target ELO strength.)
But is that fair? After all, you can’t learn that small neural network in the first place except by first passing through the very large one (as far as anyone knows). Similarly, with DNA, you have enormous ranges of genome sizes for no good apparent reason even among closely related species and viruses demonstrate that you can get absurd compression out of DNA by overlapping genes or reading them backwards (among other insane tricks), but such minified genomes may be quite fragile and such junk DNA and chromosomal or whole-genome duplications often lead to big genetic changes and adaptations and speciations, so all that fat may be serving evolvability or robustness purposes. Like NNs, maybe you can only get that hyper-specialized efficient genome after passing through a much larger overparameterized genome. (Viruses, then, may get away with such tiny genomes by optimizing for relatively narrow tasks, and applying extraordinary replication & mutation rates, and outsourcing as much as they can to regular cells or other viruses or other copies of themselves, like ‘multipartite viruses’. And even then, some viruses will have huge genomes.) https://slatestarcodex.com/2020/05/12/studies-on-slack/ and https://www.gwern.net/Backstop and https://www.gwern.net/Hydrocephalus might be relevant reading here.