jacob_cannell comments on GPT-175bee

jacob_cannell 12 Feb 2023 2:13 UTC
4 points
3
Parameter/synapse count is actually not really that important by itself; the first principle component in terms of predictive capability is net training compute. All successful NNs operate in the overcomplete regime, where they have far more circuit capacity than the minimal circuit required to achieve a comparable capability on their training set. This is implied by the various scaling paper laws, it’s also why young human children have an OOM more synapses than adults, why you can prune down a trained network by some OOMs related to it’s overcapacity factor, why there are so many DL papers about the “lottery ticket” hypothesis and related, etc.

net_training_compute = synaptic_compute * training_time

It’s about the total circuit space search volume explored, not the circuit size. You can achieve the same volume and thus capability by training a smaller more compressed circuit for much longer (as in ANNs), or a larger circuit for less time (as in BNNs).
- the gears to ascension 12 Feb 2023 5:45 UTC
  4 points
  2
  Parent
  Only if you’re overcomplete enough to have a winning ticket at init time. With that caveat, agreed. If you don’t have a winning ticket at init time, you need things like evolutionary search, which can be drastically less efficient depending on the details of the update rule.