For networks using sparsity, I wonder if you could argue that a well-optimized sparse network will approach from above a 3:1 as well as throughput/training is optimized?
“Systems exploiting sparsity” opens the pandora’s box of more complex algorithms that can spend differentialy with arbitrary flexibility on forward vs backward updates. For example standard SGD with it’s 3:1 rule is a somewhat specific arbitrary (but sensible schelling point) in the vast space of approximate bayesian backprop algorithms. There are some that spend a bit more compute in the activation/sparsity step of the forward pass to find better sparse approximations (for compression and downstream compute savings and improved orthgonality/curvature for faster convergence), and then exploit that known activation sparsity more in the backpass. And/or others that spend on more general inversion inference in the backpass which can jump to new configs in the energy landscape for faster inference/learning rather than making tiny incremental gradient steps. And then there are algorithms that track variance/precision dynamically across swaths of parameter space and decide dynamically where and when to invest in updating, avoiding spending energy updating parameters that already have sufficiently high precision and have little to gain from the current evidence update. 3:1 is clearly not some fundamental optimal ratio from physics.
Human brains may not activate every area equally frequently on average, but we can’t choose to pseudo-experience arbitrary subsets of data to train on, either.
Hmm I’d argue we sort of can: daydreaming, imagining, memory, hippocampal replay during sleep—all of those are forms of active learning picking valuable episodes (training data subsets) to pseudo-experience. And the pseudo-experience really does look very similar to experience, region by region, neural activity wise. But also similar to imitation imagination—ie when watching someone do some activity, the brain can translate that into an imaginary experience with neural activity similar to doing it, and try to learn on that.
Consider a MoE:
I think your analysis decomposition technique here is interesting but even putting aside the potential MoE specificity it’s assuming that 1st order bprop is the only game in town, and that it is always optimal to update on all the same paths that were active in the forward pass. I’d just summarize it as “In an MoE system where each MoE is a dense model using some standard 1st order bprop such that 3:1 applies, then even with arbitrarily fancy algorithms to decide a sparse active subset of experts, the whole MoE system will also be 3:1”. Sure.
MoE’s aren’t so interesting scaling wise as they don’t take advantage of deep factoring. They can make sense for very high level modules that have truly separate function/data domains such that you don’t expect much overlap, but that’s a very limited gain, and most of the benefits of generalization come from exploiting all the deep commonality.
“Systems exploiting sparsity” opens the pandora’s box of more complex algorithms that can spend differentialy with arbitrary flexibility on forward vs backward updates. For example standard SGD with it’s 3:1 rule is a somewhat specific arbitrary (but sensible schelling point) in the vast space of approximate bayesian backprop algorithms. There are some that spend a bit more compute in the activation/sparsity step of the forward pass to find better sparse approximations (for compression and downstream compute savings and improved orthgonality/curvature for faster convergence), and then exploit that known activation sparsity more in the backpass. And/or others that spend on more general inversion inference in the backpass which can jump to new configs in the energy landscape for faster inference/learning rather than making tiny incremental gradient steps. And then there are algorithms that track variance/precision dynamically across swaths of parameter space and decide dynamically where and when to invest in updating, avoiding spending energy updating parameters that already have sufficiently high precision and have little to gain from the current evidence update. 3:1 is clearly not some fundamental optimal ratio from physics.
Hmm I’d argue we sort of can: daydreaming, imagining, memory, hippocampal replay during sleep—all of those are forms of active learning picking valuable episodes (training data subsets) to pseudo-experience. And the pseudo-experience really does look very similar to experience, region by region, neural activity wise. But also similar to imitation imagination—ie when watching someone do some activity, the brain can translate that into an imaginary experience with neural activity similar to doing it, and try to learn on that.
I think your analysis decomposition technique here is interesting but even putting aside the potential MoE specificity it’s assuming that 1st order bprop is the only game in town, and that it is always optimal to update on all the same paths that were active in the forward pass. I’d just summarize it as “In an MoE system where each MoE is a dense model using some standard 1st order bprop such that 3:1 applies, then even with arbitrarily fancy algorithms to decide a sparse active subset of experts, the whole MoE system will also be 3:1”. Sure.
MoE’s aren’t so interesting scaling wise as they don’t take advantage of deep factoring. They can make sense for very high level modules that have truly separate function/data domains such that you don’t expect much overlap, but that’s a very limited gain, and most of the benefits of generalization come from exploiting all the deep commonality.