For networks using sparsity, I wonder if you could argue that a well-optimized sparse network will approach from above a 3:1 as well as throughput/training is optimized? Human brains may not activate every area equally frequently on average, but we can’t choose to pseudo-experience arbitrary subsets of data to train on, either.
Consider a MoE: a well-balanced MoE could optimize each expert in parallel efficiently, as the gradients will not interfere with each other, they learn different things about different subsets of the data; if you let a bunch of experts sit idle, unused, not being dispatched any work, then you are letting GPUs sit idle, and you are probably putting too much work onto the ‘hotspot’ experts, letting them bottleneck training. The hotspots should be broken up into additional experts, which can then run/train separately. Data points should be picked based on activating underused experts: if a datapoint is frequent, then it is probably already ‘solved’ as it’ll get and experiencing severe diminishing returns, and you will get more value from training on a rarer datapoint whose expert is still learning. So, as you increasingly scale up your MoE and dataset, your GPUs will all be equally busy computing their experts in parallel - ‘common’ tasks will be split up (eg weights copied into two new experts which then compete over the data it used to be assigned) until they are rebalanced, and ‘common’ data will be undersampled and the slot given to oversampling ‘rare’ data. And then each expert is internally just a dense model to which the 3:1 applies.
For networks using sparsity, I wonder if you could argue that a well-optimized sparse network will approach from above a 3:1 as well as throughput/training is optimized?
“Systems exploiting sparsity” opens the pandora’s box of more complex algorithms that can spend differentialy with arbitrary flexibility on forward vs backward updates. For example standard SGD with it’s 3:1 rule is a somewhat specific arbitrary (but sensible schelling point) in the vast space of approximate bayesian backprop algorithms. There are some that spend a bit more compute in the activation/sparsity step of the forward pass to find better sparse approximations (for compression and downstream compute savings and improved orthgonality/curvature for faster convergence), and then exploit that known activation sparsity more in the backpass. And/or others that spend on more general inversion inference in the backpass which can jump to new configs in the energy landscape for faster inference/learning rather than making tiny incremental gradient steps. And then there are algorithms that track variance/precision dynamically across swaths of parameter space and decide dynamically where and when to invest in updating, avoiding spending energy updating parameters that already have sufficiently high precision and have little to gain from the current evidence update. 3:1 is clearly not some fundamental optimal ratio from physics.
Human brains may not activate every area equally frequently on average, but we can’t choose to pseudo-experience arbitrary subsets of data to train on, either.
Hmm I’d argue we sort of can: daydreaming, imagining, memory, hippocampal replay during sleep—all of those are forms of active learning picking valuable episodes (training data subsets) to pseudo-experience. And the pseudo-experience really does look very similar to experience, region by region, neural activity wise. But also similar to imitation imagination—ie when watching someone do some activity, the brain can translate that into an imaginary experience with neural activity similar to doing it, and try to learn on that.
Consider a MoE:
I think your analysis decomposition technique here is interesting but even putting aside the potential MoE specificity it’s assuming that 1st order bprop is the only game in town, and that it is always optimal to update on all the same paths that were active in the forward pass. I’d just summarize it as “In an MoE system where each MoE is a dense model using some standard 1st order bprop such that 3:1 applies, then even with arbitrarily fancy algorithms to decide a sparse active subset of experts, the whole MoE system will also be 3:1”. Sure.
MoE’s aren’t so interesting scaling wise as they don’t take advantage of deep factoring. They can make sense for very high level modules that have truly separate function/data domains such that you don’t expect much overlap, but that’s a very limited gain, and most of the benefits of generalization come from exploiting all the deep commonality.
Regardless of architecture, at the end of the day the dominant costs are all per connection:
one flop per connection (not param) in the forward pass
one flop per connection in the back gradient pass (symmetric inverse of forward)
one flop per connection in the weight gradient calc (symmetry of the gradient of multiplication)
So it should always be 3:1 in the limit, at least for dense networks. For systems exploiting sparsity it’s much more complex.
For networks using sparsity, I wonder if you could argue that a well-optimized sparse network will approach from above a 3:1 as well as throughput/training is optimized? Human brains may not activate every area equally frequently on average, but we can’t choose to pseudo-experience arbitrary subsets of data to train on, either.
Consider a MoE: a well-balanced MoE could optimize each expert in parallel efficiently, as the gradients will not interfere with each other, they learn different things about different subsets of the data; if you let a bunch of experts sit idle, unused, not being dispatched any work, then you are letting GPUs sit idle, and you are probably putting too much work onto the ‘hotspot’ experts, letting them bottleneck training. The hotspots should be broken up into additional experts, which can then run/train separately. Data points should be picked based on activating underused experts: if a datapoint is frequent, then it is probably already ‘solved’ as it’ll get and experiencing severe diminishing returns, and you will get more value from training on a rarer datapoint whose expert is still learning. So, as you increasingly scale up your MoE and dataset, your GPUs will all be equally busy computing their experts in parallel - ‘common’ tasks will be split up (eg weights copied into two new experts which then compete over the data it used to be assigned) until they are rebalanced, and ‘common’ data will be undersampled and the slot given to oversampling ‘rare’ data. And then each expert is internally just a dense model to which the 3:1 applies.
“Systems exploiting sparsity” opens the pandora’s box of more complex algorithms that can spend differentialy with arbitrary flexibility on forward vs backward updates. For example standard SGD with it’s 3:1 rule is a somewhat specific arbitrary (but sensible schelling point) in the vast space of approximate bayesian backprop algorithms. There are some that spend a bit more compute in the activation/sparsity step of the forward pass to find better sparse approximations (for compression and downstream compute savings and improved orthgonality/curvature for faster convergence), and then exploit that known activation sparsity more in the backpass. And/or others that spend on more general inversion inference in the backpass which can jump to new configs in the energy landscape for faster inference/learning rather than making tiny incremental gradient steps. And then there are algorithms that track variance/precision dynamically across swaths of parameter space and decide dynamically where and when to invest in updating, avoiding spending energy updating parameters that already have sufficiently high precision and have little to gain from the current evidence update. 3:1 is clearly not some fundamental optimal ratio from physics.
Hmm I’d argue we sort of can: daydreaming, imagining, memory, hippocampal replay during sleep—all of those are forms of active learning picking valuable episodes (training data subsets) to pseudo-experience. And the pseudo-experience really does look very similar to experience, region by region, neural activity wise. But also similar to imitation imagination—ie when watching someone do some activity, the brain can translate that into an imaginary experience with neural activity similar to doing it, and try to learn on that.
I think your analysis decomposition technique here is interesting but even putting aside the potential MoE specificity it’s assuming that 1st order bprop is the only game in town, and that it is always optimal to update on all the same paths that were active in the forward pass. I’d just summarize it as “In an MoE system where each MoE is a dense model using some standard 1st order bprop such that 3:1 applies, then even with arbitrarily fancy algorithms to decide a sparse active subset of experts, the whole MoE system will also be 3:1”. Sure.
MoE’s aren’t so interesting scaling wise as they don’t take advantage of deep factoring. They can make sense for very high level modules that have truly separate function/data domains such that you don’t expect much overlap, but that’s a very limited gain, and most of the benefits of generalization come from exploiting all the deep commonality.
That should be 2:1, not 3:1 (2 FLOPs per connection for the backward pass to 1 FLOP per connection for the forward pass).
And that is basically right, except for the caveats we point out in the post.