I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.